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£i-PENALIZED   QUANTILE  REGRESSION  IN  HIGH-DIMENSIONAL 

SPARSE  MODELS 

By  Alexandre   Belloni    and  Victor  Chernozhukov*'"''  ■ 

Duke  University  and  Massachusetts  Institute  of  Technology     . 

We  consider  median  regression  and,  more  generally,  quantile  re- 
gression in  high-dimensional  spEtrse  models.  In  these  models  the  over- 
all number  of  regressors  p  is  very  large,  possibly  larger  than  the  sam- 
ple size  n,  but  only  s  of  these  regressors  have  non-zero  impact  on  the  • 
conditional  quantile  of  the  response  variable,  where  s  grows  slower 
than  T7.  Since  in  this  case  the  ordinary  quantile  regression  is  not  con- 
sistent, we  consider  quantile  regression  penalized  by  the  ^i-norm  of 
coefficients  (fi-QR).  First,  we  show  that  £i-QR  is  consistent  at  the 
rate  ys/n-\/Iogp,  which  is  close  to  the  oracle  rate  \/s/n,  achievable 
when  the  minimal  true  model  is  known.  The  overall  number  of  re- 
gressors p  affects  the  rate  only  through  the  logp  factor,  thus  allowing 
nearly  exponential  growth  in  the  number  of  zero-impact  regressors. 
The  rate  result  holds  under  relatively  weak  conditions,  requiring  that 
s/n  converges  to  zero  at  a  super-logarithmic  speed  and  that  regular- 
ization  parameter  satisfies  certain  theoretical  constraints.  Second,  we 
propose  a  pivotal,  data-driven  choice  of  the  regularization  parame- 
ter and  show  that  it  satisfies  these  theoretical  constraints.  Third,  we 
show  that  £i-QR  correctly  selects  the  true  minimal  model  as  a  valid 
submodel,  when  the  non-zero  coefficients  of  the  true  model  are  well 
separated  from  zero.  We  also  show  that  the  number  of  non-zero  co- 
efficients in  fi-QR  is  of  same  stochastic  order  as  s,  the  number  of 
non-zero  coefficients  in  the  minimal  true  model.  Fourth,  we  analyze 
:  the  rate  of  convergence  of  a  two-step  estimator  that  applies  ordi- 
nary quantile  regression  to  the  selected  model.  Fifth,  we  evaluate  the 
performance  of  ^i-QR  in  a  Monte-Carlo  experiment,  and  provide  an 
application  to  the  analysis  of  the  international  economic  growth. 
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1.  Introduction.  Quantile  regression  is  an  important  statistical  method  for  analyzing 
the  impact  of  regressors  on  the  conditional  distribution  of  a  response  variable  (cf.  Laplace 
[22],  Koenker  and  Bassett  [20]).  In  particular,  it  captures  the  heterogeneity  of  the  impact 
of  regressors  on  the  different  parts  of  the  distribution  [7],  exhibits  robustness  to  outliers 
[19],  has  excellent  computational  properties  [29],  and  has  a  wide  applicabiUty  [19].  The 
asymptotic  theory  for  quantile  regression  is  well-developed  under  both  fixed  number  of 
regressors  and  increasing  number  of  regressors.  The  asymptotic  theory  under  fixed  number  of 
regressors  is  given  by  Koenker  and  Bassett  [20],  Portnoy  [28],  Gutenbrunner  and  Jureckova 
[14],  Knight  [17],  Chernozhukov  [9]  and  others.  The  asymptotic  theory  under  increasing 
number  of  regressors  is  given  in  He  and  Shao  [1-5]  and  Belloni  and  Chernozhukov  [4,  .5], 
covering  the  case  where  the  number  of  regressors  p  is  negligible  relative  to  the  sample  size 
n  [p  =  o{n)). 

In  this  paper,  we  consider  quantile  regression  in  high-dimensional  sparse  models  (HDSMs). 
In  such  models,  the  overall  number  of  regressors  p  is  very  large,  possibly  much  larger  than 
the  sample  size  n.  However,  the  number  s  of  significant  regressors  -  those  having  a  non-zero 
impact  on  the  response  variable  -  is  smaller  than  the  sample  size,  that  is,  s  —  o[n).  The 
HDSMs  ([8],  [26])  have  emerged  to  deal  with  many  new  applications,  arising  in  biometrics, 
signal  processing,  machine  learning,  econometrics,  and  other  areas  of  data  analysis,  where 
high-dimensional  data  sets  have  become  widely  available.  '       . 

A  number  of  papers  began  to  investigate  estimation  of  HDSMs,  primarily  focusing  on 
penalized  mean  regression,  with  ^i-norm  acting  as  a  penalty  function.  Candes  and  Tao  [8] 
demonstrated  that,  remarkably,  an  estimator,  called  the  Dantzig  selector,  achieves  the  rate 


Vs/nvlogp,  which  is  very  close  to  the  oracle  rate  \/s/n  obtainable  when  the  significant 
regressors  are  known.  Thus  the  estimator  can  be  consistent  even  under  very  rapid,  nearly 
exponential  growth  in  the  total  number  of  regressors  p.  Meinshausen  and  Yu  [26]  and  Zhang 
and  Huang  [39]  demonstrated  similar  striking  results  for  the  ^i -penalized  least  squares 
proposed  by  Tibshirani  [35].  van  der  Geer  [37]  derived  valuable  finite  sample  bounds  on 
empirical  risk  for  ^i-penalized  estimators  in  generalized  linear  models.  Fan  and  Lv  [11]  used 
screening  and  derived  asymptotic  results  under  even  weaker  conditions  on  p.  There  were 
many  other  interesting  developments,  which  we  shall  not  review  here. 

Our  paper's  contribution  is  to  develop,  within  the  HDSM  framework,  a  set  of  results  on 
model  selection  and  rates  of  convergence  for  quantile  regression.  Since  ordinary  quantile 
regression  is  not  consistent  in  HDSMs,  we  consider  quantile  regression  penalized  by  the 
£i-norm  of  parameter  coefficients.  We  show  that  the  ^j-penahzed  quantile  regression  is 
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consistent  at  the  rate  -y/s/n-v/logp,  which  is  close  to  the  oracle  rate  ^J sjn  achievable  when 
the  true  minimal  model  is  known.  In  order  to  make  the  penahzed  estimator  practical, 
we  propose  a  pivotal,  data-driven  choice  of  the  regularization  parameter,  and  show  that 
this  choice  leads  to  the  same  sharp  convergence  rate.  Further,  we  show  that  the  penalized 
quantile  regression  correctly  selects  the  true  minimal  model  as  a  valid  submodel,  when 
the  non-zero  coefficients  of  the  true  model  are  well  separated  from  zero.  We  also  analyze 
a  two-step  estimator  that  applies  standard  quantile  regression  to  the  selected  model  and 
aims  at  reducing  the  bias  of  the  penalized  quantile  estimator.  We  illustrate  the  use  of  the 
penalized  and  post-penalized  estimators  with  a  Monte  carlo  experiment  and  an  international 
economic  growth  example.  Thus,  our  results  contribute  to  the  literature  on  HDSMs  by 
examining  a  new  class  of  problems.  Moreover,  our  proof  strategy,  developed  to  cope  with 
non-linearities  and  non-smoothness  of  quantile  regression,  may  be  of  interest  in  other  M- 
estimation  problems.  (We  provide  more  detailed  comparisons  to  the  literature  in  Section 
2.) 

Finally,  let  us  comment  on  the  role  of  computational  considerations  in  our  analysis. 
The  choice  of  the  £i -penalty  function  arises  from  considering  a  tradeoff  between  statistical 
efficiency  and  computational  efficiency,  with  the  latter  being  of  particular  importance  in 
high-dimensional  applications.  Indeed,  in  model  selection  problems,  the  statistical  efficiency 
criterion  favors  the  use  of  the  ^o-penalty  functions  (Akaike  [1]  and  Schwarz  [32]),  where  the 
^0-penalty  counts  the  the  number  of  non-zero  components  of  a  parameter  vector.  However, 
the  computational  efficiency  criterion  favors  the  use  of  convex  penalty  functions.  Indeed, 
convex  penalty  functions  lead  to  efficient  convex  programming  problems  ([27j);  in  contrast, 
the  £o-penalty  functions  lead  to  inefficient  combinatorial  search  problems,  plagued  by  the 
computational  curse  of  dimensionality.  Precisely  because  it  is  a  convex  function  that  is 
closest  to  the  £o-penalty  (e.g.  [30]),  the  ^i-penalty  has  emerged  to  play  a  central  role  in 
HDSMs,  in  general  (e.g.  [25]),  and  in  our  analysis,  in  particular.  In  other  words,  the  use 
of  the  ^i-penalty  takes  us  close  to  performing  the  most  effective  model  selection,  while 
respecting  the  computational  efficiency  constraint.  '     ,  ,   _  . 

We  organize  the  rest  of  the  paper  as  follows.  In  Section  2,  we  introduce  the  problem  and 
some  simple  primitive  assumptions  D.1-D.4,  and  propose  pivotal  choices  for  the  regular- 
ization parameter.  We  also  describe  our  key  results  under  D.1-D.4,  and  provide  detailed 
comparisons  with  the  literature.  In  Section  3,  we  develop  the  main  results  under  condi- 
tions E.1-E.5,  which  are  implied  by  D.1-D.4,  and  also  hold  much  more  generally.  Section  4 
analysis  the  pivotal  choice  of  the  penalization  parameter.  In  Section  5,  we  carry  out  a  com- 


4         ^  ■  ,  .    BELLONI  AND  CHERNOZHUKOV        ■,,  • 

putational  experiment  and  provide  an  application  to  an  international  growtli  example.  In 
Section  6,  we  provide  conclusions  and  discuss  possible  extensions.  In  Appendix  A,  we  verify 
that  conditions  E.1-E.5  are  implied  by  conditions  D.1-D.4  and  also  hold  more  generally. 

1.1.  Notation.  In  what  follows,  we  implicitly  index  all  parameter  values  by  the  sample 
size  n,  but  we  omit  the  index  whenever  this  does  not  cause  confusion.  We  carry  out  all  of 
the  asymptotic  analysis  as  n  — >  oo.  We  use  the  notation  a  <  6  to  denote  that  a  =  0(6),  that 
is  a  <  cb  for  all  sufficiently  large  n,  for  some  constant  c  >  0  that  does  not  depend  on  n,  and 
we  use  a  <p  b  to  denote  that  a  =  Op{b);  we  use  a  ~  6  to  denote  a  <  b  <  a  and  a  ~p  6  to 
denote  a  <p  b  <p  a.  We  also  use  the  notation  a  V  6  =  max{a,  b}  and  a  /\b  —  min{a,  b}.  We 
denote  ^2-iiorm  by  ||  •  ||,  £i-norm  by  ||  ■  ||i,  £oo-norm  by  ||  ■  ||oo,  and  the  £o-"norm"  by  |]  ■  ||o. 

2.  Basic  Settings,  the  Estimator,  and  Overview  of  Results.  In  this  section, 
we  formulate  the  setting,  the  estimator,  and  state  primitive  regularity  conditions.  We  also 
provide  an  overview  of  the  main  results.       ^  - 

2.1.  Basic  Setting.  The  set-up  of  interest  corresponds  to  a  parametric  quantile  regression 
model,  where  the  dimension  p  of  the  underlying  model  increases  with  the  sample  size  n. 
Namely,  we  consider  a  response  variable  y  and  p-dimensional  covariates  x  such  that  the 
u-th  conditional  quantile  function  of  y  given  x  is  given  by 

(2.1)  '•  Qy^,iu)  =  x'f3{u)^    piu)eW. 

We  consider  the  case  where  the  dimension  p  of  the  model  is  large,  possibly  much  larger  than 
the  available  sample  size  n,  but  the  true  model  /3(u)  is  sparse  having  only  s  =  s{u)  <  p 
non-zero  components.  Throughout  the  paper  the  quantile  index  u  G  (0, 1)  is  fixed. 

The  population  coefficient  P{u)  is  known  to  be  a  minimizer  of  the  criterion  function 

(2.2)  •■  .       .  Q^iP)^E[p^{y-x'f3)], 

where  Pu{t)  =  (u  —  l{t  <  Q})t  is  the  asymmetric  absolute  deviation  function  [20].  Given 
a  random  sample  {yi,x\), . . .  ,{yn,x„),  the  quantile  regression  estimator  /3{u)  of  l3{u)  is 
defined  as  a  minimizer  of 

(2.3)  Qu{P)=En[pu{y,-x[0)] 

where  E„  \fiyi,Xi)]  :=  n~^  IZ"=i  f[yi^X'i)  denotes  the  empirical  expectation  of  a  function  / 
in  the  given  sample. 
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In  the  high-dimensional  settings,  particularly  when  p  >  n,  quantile  regression  is  generally 
not  consistent,  which  motivates  the  use  of  penalization  in  order  to  remove  all  or  at  least 
nearly  all  regressors  whose  population  coefficients  are  zero,  thereby  possibly  restoring  con- 
sistency. The  penalization  that  has  been  proven  to  be  quite  useful  in  least  squares  settings 
is  the  £i-penalty  leading  to  the  lasso  estimator  [35]. 

2.2.  The  Choice  of  Estimator,  Linear  Programming  Formulation,  and  Its  Dual.  The 
^i-penalized  quantile  regression  estimator  /?(u)  is  a  solution  to  the  following  optimization 
problem; 

(2.4)  min   Q„(/?)  +  ^  ^  |^_,.|. 

When  the  solution  is  not  unique,  we  define  (3{u)  as  a  basic  solution  having  the  minimal 
number  of  non-zero  components.  The  criterion  function  in  (2.4)  is  the  sum  of  the  criterion 
function  (2.3)  and  a  penalty  function  given  by  a  scaled  £i-norm  of  the  parameter  vector. 
This  £i-penalized  quantile  regression  or  quantile  regression  lasso  has  been  considered  by 
Knight  and  Fu  [18]  under  the  small  (fixed)  p  asymptotics. 

For  computational  purposes,  it  is  important  to  note  that  the  penalized  quantile  regression 
problem  (2.4)  is  equivalent  to  the  following  linear  programming  problem 

\    p  -      .  _■    . 

min  E„[<+  +  (l-u)erl  +  -E('^^  +  /?7") 

(2.5)  €+.r,/3+,/3-€Kf+^''  L  '-'  '^^  i       nfr['  ^         ^' 

^+-^-  ^y,-x',{P+-p-),    i  =  l,...,n.  '    " 

The  problem  minimizes  a  sum  of  f  i-norm  of  the  absolute  positive  /3^  and  negative  P~  parts 
of  the  parameter  /3j  =  P^  —  (3~  and  of  an  average  of  asymmetrically  weighted  residuals 
^f  and  ^~.  The  linear  programming  formulation  (2.5)  is  useful  for  computation  of  the 
estimator,  particularly  in  high-dimensional  applications.  There  are  a  number  of  efficient, 
that  is,  polynomial  time,  algorithms  for  the  linear  programming  problem  (2.5).  Using  these 
algorithms,  one  can  compute  the  estimator  (2.4)  efficiently,  avoiding  the  computational 
curse  of  dimensionality. 

Furthermore,  for  both  computational  and  theoretical  purposes,  it  is  important  to  note 
that  the  primal  problem  (2.5)  has  the  following  dual  problem: 

max      E„  Iviai]  .  ^ 

(2-6)  .,•  |E„[a:i,a,]|<^,    i  =  l,...,p,        '  -    • 

(tt  —  1)  <  a.j  <  w,    i  =  1, . . .  ,  77,. 
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The  dual  problem  maximizes  the  correlation  between  the  response  variable  and  the  rank 
scores  subject  to  the  condition  requiring  the  rank  scores  to  be  approximately  uncorrelated 
with  the  regressors.  This  condition  is  reasonable,  since  the  true  rank  scores,  defined  as 
a*{u)  —  {u  —  l{yi  <  x^P{u)}),  should  be  independent  of  regressors  Xi.  This  follows  because 
by  (2.1)  the  event  {yi  <  x'^(3{u)}  is  equivalent  to  the  event  {ui  <  u},  for  a  standard  uniformly 
distributed  variable  Uj  which  is  independent  oi  Xi. 

Since  both  primal  and  dual  problems  are  feasible,  by  strong  duality  for  linear  program- 
ming the  optimal  values  of  (2.6)  equals  the  optimal  value  of  (2.4)  (see,  for  example,  Bert- 
simas  and  Tsitsiklis  [6]).  The  optimal  solution  to  the  dual  problem  plays  an  important  role 
in  our  analysis,  helping  us  control  the  sparseness  of  the  penalized  estimator  P{u)  as  well 
as  choose  the  penalization  parameter  A.  Of  course,  the  optimal  solution  to  the  dual  prob- 
lem (2.6)  also  plays  an  important  role  in  the  non-penalized  case,  with  A  =  0,  yielding  the 
regression  generalization  of  Hajek-Sidak  rank  scores  (Gutenbrunner  and  Jureckova  [14]). 

Another  potential  approach  worth  considering  is  the  Dantzig  selector  approach  of  Candes 
and  Tao  [8],  proposed  in  the  context  of  mean  regression.  We  can  extend  this  approach  to 
quantile  regression  by  defining  the  estimator  as  solution  to  the  following  problem; 

(2.7)        ;  ;     ,flt\f3^\--  \\Sumo.<l 

i=i  .  •  '  -      - 

where  A  is  a  penalization  parameter,  and  S^  is  a  subgradient  of  the  quantile  regression 
objective  function  Qu{P)- 

(2.8)  •  Su{/3)  =  En[{l{yr<x\p}-u)x,]. 

The  estimator  (2.7)  minimizes  the  £i-norm  of  the  coefficients  subject  to  a  goodness-of-fit 
constraint. 

On  computational  grounds,  we  prefer  the  ^j-penalized  estimator  (2.4)  over  to  the  Dantzig 
selector  estimator  (2.7).  The  reason  is  that  the  subgradient  Su  in  (2.8)  is  a  piece-wise  con- 
stant function  in  parameters,  leading  to  a  serious  difficulty  in  computing  the  estimator 
(2.7).  In  particular,  the  problem  (2.7)  can  be  recast  as  a  mixed  integer  programming  prob- 
lem with  n  binary  variables,  for  which  (generally)  there  is  no  known  polynomial  time  algo- 
rithm. (In  sharp  contrast,  in  the  mean  regression  case  the  subgradient  is  a  linear  function, 
S{P)  =  E„[(yi  —  Xif3)xi],  corresponding  to  the  objective  function  Q{P)  =  '^nliVi  —  ■i'i/?)'^]/2. 
Accordingly,  in  the  mean  regression  case,  the  optimization  problem  can  be  recast  as  a  linear 
programming  problem,  for  which  there  are  polynomial  time  algorithms.) 
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Another  idea  for  formulating  a  Dantzig  type  estimator  for  quantile  regression  would  be 
to  minimize  the  £i  norm  of  the  coefficients  subject  to  a  convex  goodness-of-fit  constraint, 
namely 

(2.9)  min   f^  |/3,|    :  QuiP)  <  T 

Since  the  constraint  set  {f3  :  QuiP)  <  7}  is  piece- wise  linear  and  convex,  this  problem  is 
equivalent  to  a  Hnear  programming  problem.  Of  course,  this  is  hardly  a  surprise,  since  this 
problem  is  equivalent  to  an  £i-penalized  quantile  regression  problem  (2.4)  that  we  started 
with  in  the  first  place.  Indeed,  for  every  feasible  choice  of  7  in  (2.9)  there  is  a  feasible  choice 
of  A  that  makes  the  solutions  to  (2.9)  and  to  (2.4)  identical.  To  see  this,  fix  a  7  and  let  k 
denote  the  optimal  value  of  the  Lagrange  multiplier  for  the  constraint  QuiP)  <  7,  then  the 
problem  (2.9)  is  equivalent  to  min^gup  Yl^=\  \Pj\  +  f^iQuiP)  —  7)1  which  is  then  equivalent 
to  the  original  problem  (2.4)  with  A  =  n/n.  Therefore  it  suffices  to  focus  our  analysis  on 
the  original  problem. 

2.3.  The  Choice  of  the  Regularization  Parameter.  Here  we  propose  a  pivotal,  data-driven 
choice  for  the  regularization  parameter  value  A.  We  shall  verify  in  Section  4  that  such 
choice  will  agree  with  our  theoretical  choice  of  A  maximizing  the  speed  of  convergence  of 
the  penalized  estimate  to  the  true  parameter  value.  ■  -  " 

Because  the  objective  function  in  the  primal  problem  (2,4)  is  not  pivotal  in  either  small  or 
large  samples,  finding  a  pivotal  A  appears  to  be  difficult  a  priori.  However,  instead  of  looking 
at  the  primal  problem,  let  us  look  at  its  linear  programming  dual  (2.6),  which  requires  that 

(2.10)  -    -    |E„[a:jjaj]|  <  -,for  all  j  =  l,...,p<^  ||E„[.x,aj]||^  <  -.  ■■  -.• 

This  restriction  requires  that  potential  rank  scores  must  be  approximately  uncorrelated 
with  regressors.  It  then  makes  sense  to  select  A  so  that  the  true  rank  scores 

■   :■-     "  o.*{u)  =  {u-l{yi  <  XiP{u)})    fori  =  l,...,n 

satisfy  this  constraint.  That  is,  we  can  potentially  set  A  =  A„,  where 

(2.11)  ■     ■  '  A„  =  n||E„[x,a*(t.)]||^.         ' 

Of  coinrse,  since  we  do  not  observe  the  true  rank  scores,  this  choice  is  not  available  to  us. 
The  key  observation  is  that  the  finite  sample  distribution  of  A„  is  pivotal  conditional  on 
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the  regressors  xi, . . . ,  a;„.  We  know  that  rank  scores  can  be  represented  almost  surely  as 

a*{u)  =  {u  —  l{ui  <  u}),    for  i  =  1. . . .  ,n,  ,    . 

where  uj, . . .  ,u„  are  i.i.d.  uniform  (0, 1)  random  variables,  independently  distributed  from 
the  regressors,  xi, ...  jXn.  Thus,  we  have       ^  ■.  ..'-:„,-,■  '.  .:,    ■   .:■ 

(2.12)  \      ;  A^=n\\En\x,{u-l{u^<u})]\\^,     ■        '  '" 

which  has  a  known  distribution  conditional  on  xi, . . .  ,Xn.  Therefore  we  can  use  the  tail 
quantiles  A„  as  our  choice  for  A.  In  particular,  we  set  A  =  A(xi,...,x„)  as  the  1  —  a„ 
quantile  of  A„ 

(2.13)  A  =  inf{c:P(A„<c|a;i,...,x„)  >  1-Q„}, 

where  q„  \  0  at  some  rate  to  be  determined  below. 

Finally,  let  us  note  that  we  can  also  derive  the  pivotal  quantity  A„,  and  thus  also  our 
choice  of  the  regularization  parameter  A,  from  the  subgradient  characterization  of  optimality 
for  the  primal  problem  (2.4). 

2.4.  Primitive  Conditions.  We  follow  Huber's  framework  of  high-dimensional  parame- 
ters [16],  which  formally  consists  of  a  sequence  of  models  with  parameter  dimension  p  —  Pn 
tending  to  infinity  as  the  sample  size  n  grows  to  infinity.  Thus,  the  parameters  of  the  mod- 
els, the  parameter  space,  and  the  parameter  dimension  are  all  indexed  by  the  sample  size 
n.  However,  following  Huber's  convention,  we  will  omit  the  index  n  whenever  this  does  not 
cause  confusion.  Let  us  consider  the  following  set  of  conditions: 

D.l.  Sampling.  Data  (yi,Xj)',i  —  l,...,n  are  an  i.i.d.  sequence  of  real  (1  4- p)-vectors, 
with  the  conditional  u-quantile  function  given  by  (2.1),  and  with  the  first  component  of  x^ 
equal  to  one. 

D.2.  Sparseness  of  the  True  Model.  The  number  of  non-zero  components  of  l3{u)  is 
bounded  by  1  <  s  =  s^  <  n/\og{n  V  p). 

D.3.  Smooth  Conditional  Density.  The  conditional  density  fy-\x,{y\x)  and  its  derivative 
■^  fy^\x,{y\x)  are  bounded  above  uniformly  in  y  and  x  ranging  over  supports  of  yi  and  Xi, 
and  uniformly  in  n. 

0.4.  Identifiability  in  Population  and  Well-Behaved  Regressors.  Eigenvalues  of  the  popu- 
lation design  matrix  E[a:tX^]  are  bounded  above  and  away  from  zero,  and  suphqii^i  E[|Xjap] 
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is  bounded  above,  uniformly  in  n.  The  conditional  density  evaluated  at  the  conditional  quan- 
tile,  fy.\x^{x'P{u)\x)  is  bounded  away  from  zero,  uniformly  in  x  ranging  over  the  support  of 
Xj,  and  uniformly  in  n. 

The  conditions  D.1-D.4  stated  above  are  a  set  of  simple  conditions  that  ensure  that  the 
high-level  conditions  developed  in  Section  3  hold.  These  conditions  allow  us  to  demonstrate 
the  general  applicability  of  our  results  and  straightforwardly  compare  to  other  results  in 
the  literature.  In  particular,  condition  D.l  imposes  random  sampling  on  the  data,  which  is 
a  conventional  assumption  in  asymptotic  statistics  (e.g  [38]).  Condition  D.2  requires  that 
the  effective  dimension  of  the  true  model  is  smaller  than  the  sample  size.  Condition  D.3 
imposes  some  smoothness  on  the  conditional  distribution  of  the  response  variable.  Condition 
D.4  requires  the  population  design  matrix  to  be  uniformly  non-singular  and  the  regressors' 
moments  to  be  well-behaved. 

Further,  let  (f){k)  be  the  maximal  /c-sparse  eigenvalue  of  the  empirical  design  matrix 
En  [xiX^],  that  is, 

(2.14)  (t>ik)=         sup         En\{a'x^f].        ..■;,;       '^:  ■,   ;.._. 

lla||<l,l|Q||o<fc  L  J  -  . 

Following  Meinshausen  and  Yu  [26] ,  we  will  state  our  general  results  on  convergence  rates  of 
the  penalized  estimator  in  terms  of  the  maximal  sparse  eigenvalue  4>{mo)-  Meinshausen  and 
Yu  [26]  worked  with  mo  =  n  Ap  as  an  initial  upper  bound  on  the  zero  norm  of  the  penalized 
estimator.  In  this  paper  we  can  work  with  a  smaller  mo,  in  particular,  under  D.1-D.4,  we 
can  work  with  -j     '  ■        .  '  ■     ' 

.     -  ■       m,o  =pA  (n/log(n  Vp)),  '     ■         •  ■  ■ 

as  this  provides  a  valid  initial  bound  on  the  zero  norm  of  our  penalized  estimator  under  a 
suitable  choice  of  the  penalization  parameter. 

By  using  an  assumption  on  the  growth  rate  of  (i){k.),  we  avoid  imposing  Candes  and  Tao's 
[8]  uniform  uncertainty  principle  on  the  empirical  design  matrix  E„  [x,x'^\.  Meinshausen  and 
Yu  [26]  argue  that  the  assumption  in  terms  of  0(/c)  are  less  stringent  than  the  uniform 
uncertainty  principle,  since  it  allows  for  non-vanishing  correlation  between  the  regressors. 
Meinshausen  and  Yu  [26]  provide  a  thorough  discussion  of  the  behavior  of  (p{n)  in  many 
cases  of  interest.  In  particular,  they  show  that  the  condition  4>{n)  <p  1  appears  reasonable 
in  several  cases  (for  example,  when  the  empirical  design  matrix  is  block  diagonal).  Note 
that  if  the  intercept  is  included  as  a  covariate  we  have  cf){l)  >  1.  For  the  purposes  of  a  basic 
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overview  of  results  in  the  next  subsection,  we  employ  the  assumption 

(2.15)  <^(n/log(nVp))<pl.  •'      ■■         :  '    ■  ^ 

which  will  cover  standard  Gaussian  regressors  and  some  other  regressors  considered  in 
Meinshausen  and  Yu  [26]  (because  (p{n/log{n  V  p))  <  (p{n)).  Furthermore,  in  our  general 
analysis  presented  in  Section  3,  we  do  not  impose  (2.15)  and  allow  for  the  sparse  eigenvalue 
(l>{n/\og{n'V  p))  to  diverge,  which  should  permit  for  situations  with  regressors  having  tails 
thicker  than  Gaussian. 

In  order  to  illustrate  our  conditions  we  employ  the  following  canonical  examples  through- 
out the  paper. 

Example  1  (Isotropic  Normal  Design).     Let  us  consider  estimating  the  median  (u  = 
1/2)  of  the  following  regression  model 

.     ■  ■  y  =  x'/3o  +  e,         '     ■  . 

where  the  covariate  xi  =  1  is  the  intercept  and  the  covariates  X-i  ~  A'^(0,  /),  and  the  errors 
are  independent  identically  distributed  with  a  smooth  probability  density  function  which 
is  positive  at  zero  and  has  bounded  derivatives.  This  example  satisfies  conditions  D.l,  D.3, 
D.4,  and  D.2  if  |j/3o|io  <  s  =  o(n/log(n  Vp)).  Moreover,  the  maximal  A'-sparse  eigenvalues 
for  k  <  n  satisfy  ■      - 


(/)(/c)  :—         sup         E„    (q't,)'    ~j,  1  + 


=l,i|a||o<A- 


k\ogp 


by  Lemma  14.  Thus,  this  design  satisfies  our  conditions  with  (t){n/  log(nVp))  ~p  1.  Moreover, 
as  shown  in  [8],  this  design  satisfies  Candes  and  Tao's  uniform  uncertainty  principle. 

Example  2  (Correlated  Normal  Design).  We  consider  the  same  setup  as  in  Example 
1,  but  instead  we  suppose  that  the  covariates  are  correlated,  namely  x^i  ~  A^(0,  E),  where 
E,j  =  p''~-'l  and  — 1  <  p  <  1  is  fixed.  This  example  satisfies  conditions  D.l,  D.3.  D.4,  and 
D.2  if  ||,/3o||o  <  s  =  o(n/log(n  V  p)).  The  maximal  fc-sparse  eigenvalues  for  k  <  n  satisfy 


m:=         sup         E„  [(a'.,f  ]  -,  i±M     1  +   /^Li^ 

l|a!|  =  l,||al|o<fc  ^  J  1  -  IpI    \  \        n       j 

by  Lemmas  14.  Thus,  this  design  satisfies  our  conditions  with  4>{n/  log(nVp))  ~p  1.  However, 
as  mentioned  in  [2G]  this  design  violates  Candes  and  Tao's  uniform  uncertainty  principle, 
which  requires  \p\  ^  0  at  logp  rate. 
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Finally,  it  is  worth  noting  that  our  analysis  in  Sections  3  and  4,  and  in  Appendix  A  allows 
the  key  parameters  of  the  model,  such  as  the  bounds  on  the  eigenvalues  of  the  design  matrix 
and  on  the  density  function,  to  change  with  the  sample  size.  This  will  explicitly  allow  us 
to  trace  out  the  impact  of  these  parameters  on  the  large  sample  behavior  of  the  penalized 
estimator.  In  particular,  we  will  be  able  to  immediately  see  how  some  basic  changes  in 
the  primitive  conditions  stated  above  affect  the  large  sample  behavior  of  the  penalized 
estimator. 

2.5.  Overview  of  Mam  Results.  Here  we  discuss  our  results  under  simplest  assumptions, 
consisting  of  conditions  D.1-D.4  and  condition  (2,15)  on  the  maximal  (n/log(n  Vp))-sparse 
eigenvalue.  These  simplest  assumptions  allow  us  to  straightforwardly  compare  our  results  to 
those  obtained  in  the  literature,  without  getting  into  nuisance  details.  We  state  our  results 
under  more  general  conditions  in  the  subsequent  sections:  in  Section  3,  we  present  various 
results  on  convergence  rates  and  model  selection;  in  Section  4,  we  analyze  our  choice  of  the 
penalization  parameter. 

In  order  to  achieve  the  most  rapid  rate  of  convergence,  we  need  to  choose 


(2.16)  A  =  i^nlog(nVp) 

with  t  growing  as  slowly  as  possible  with  n;  for  concreteness,  let  t  oc  log  log  n. 

Our  first  main  result  is  that  the  ^j-penalized  quantile  regression  estimator  converges  at 
the  rate: 

(2.17)  -     ;,     ,       mu)-(3iu)\\  S  ^  =  ^.t .  0og(nVp),  ,.;.     -     ,      ,    ,. 
provided  that  the  number  of  non-zero  components  s  satisfies         '    '           '-  '"  \       -    ^ 


(2.18)  =•  -:       -  ■     J--t'J\og{nVp)^0.  '■■■■ 

\  n         *  '         ■ 

We  note  that  the  total  number  of  regressors  p  affects  the  rate  of  convergence  (2.17)  only 
through  a  logarithm  in  p.  Hence  if  p  is  polynomial  in  n,  the  rate  of  convergence  is  sjs/n  ■ 
t  ■  \/log(n  Vp),  which  is  very  close  to  the  oracle  rate  \/s/n,  obtainable  when  we  know  the 
minimal  true  model.  Further,  we  note  that  our  resulting  restriction  (2.18)  on  the  dimension 
s  of  the  minimal  true  model  is  very  weak;  when  p  is  polynomial  in  n  and  t  a  log  log  n,  s 
can  be  of  almost  the  same  order  as  n,  namely  s  =  o(n/(i^logn))). 

Our  second  main  result  is  that  the  dimension  ||5(u)||o  of  the  model  selected  by  the  ii- 
penalized  estimator  is  of  the  same  stochastic  order  as  the  dimension  s  of  the  minimal  true 


12.,  ,  "•    ,'  BELLONI  AND  CHERNOZHUKOV  -,   ; 

model,  namely  '.  ■■-    . 

(2.19)        .^ .  7         ,  \  mu)\\o<ps.    ■■;■ ;;:  .■ '"  V. .''',- 

Further,  if  the  parameter  values  of  the  minimal  true  model  are  well  separated  away  from 
zero,  namely  -  '  '        ~  '  '  ■ 


(2.20)  -,,_„,.       '.       min  |/3,(u)|  >  ^  J- •  t  •  0og(n  Vp), 

for  some  diverging  sequence  i  of  positive  constants,  then  with  probability  converging  to  one, 
the  model  selected  by  the  £i-penahzed  estimator  correctly  nests  the  true  minimal  model: 

(2.21)  support  (/3(u))  C  support  {0iu)). 

Moreover,  we  provide  conditions  under  which  a  hard-thresholding  selects  the  correct  sup- 
port. 

Our  third  main  result  is  that  a  two-step  estimator,  which  applies  standard  quantile  re- 
gression to  the  selected  model,  achieves  a  similar  rate  of  convergence; 


y-\/log("Vp) 


(2.22)  ^-^log(nvp)^0, 

provided  the  true  non-zero  coefficients  are  well-separated  from  zero  in  the  sense  of  equation 
(2.20).  '       ~ 

Finally,  our  fourth  main  result  is  to  propose  (2.13),  a  data-driven  choice  of  the  regu- 
larization  parameter  A  which  has  a  pivotal  finite  sample  distribution  conditional  on  the 
regressors,  and  to  verify  that  (2.13)  satisfies  the  theoretical  restriction  (2.16),  supporting 
its  use  in  practical  estimation. 

Our  results  for  quantile  regression  parallel  the  results  for  least  squares  by  Meinshausen 
and  Yu  [26]  and  by  Candes  and  Tao  [8].  Our  results  on  the  pivotal  choice  of  the  regularization 
parameter  partly  parallel  the  results  by  Candes  and  Tao  [8],  except  that  our  choice  is  pivotal 
whereas  Candes  and  Tao's  choice  relies  upon  the  knowledge  of  the  standard  deviation  of 
the  regression  disturbances.  The  existence  of  close  parallels  may  seem  surprising,  since,  in 
contrast  to  the  least  squares  problem,  our  problem  is  highly  non-linear  and  non-smooth. 
Nevertheless,  there  is  an  intuition  presented  below,  suggesting  that  we  can  overcome  these 
difficulties. 

While  our  results  for  quantile  regression  parallel  results  for  least  squares,  our  proof  strat- 
egy is  substantially  different,  as  it  has  to  address  non-linearities  and  non-smoothness.  In 
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order  to  explain  the  difference,  let  us  recall,  e.g.,  the  proof  strategy  of  Meinshausen  and 
Yu  [26].  They  first  analyze  the  problem  with  no  disturbances,  recognize  sparseness  of  the 
solution  for  this  zero  noise  problem,  and  then  analyze  a  sequence  of  problems  along  the 
path  interpolating  the  zero-noise  problem  and  the  full-noise  problem.  Along  this  sequence, 
they  bound  the  increments  in  the  number  of  non-zero  components  and  in  the  rates  of 
convergence.  This  approach  does  not  seem  to  work  for  our  problem,  where  the  zero-noise 
problem  does  not  seem  to  have  either  the  required  sparseness  or  the  required  smoothness.  In 
sharp  contrast,  our  approach  directly  focuses  on  the  full-noise  problem,  and  simultaneously 
bounds  the  number  of  non-zero  components  and  convergence  rates.  Thus,  our  approach  may 
be  of  independent  interest  for  other  M-estimation  problems  and  even  for  the  least  squares 
problem. 

Our  analysis  is  perhaps  closer  in  spirit  to,  but  still  quite  different  from,  the  important 
work  of  van  der  Geer  [37]  which  derived  finite  sample  bounds  on  the  empirical  risk  of  ii- 
penalized  estimators  in  generalized  linear  models  (but  did  not  investigate  quantile  regression 
models).  The  major  difference  between  our  proof  and  van  der  Geer  [37] 's  proof  strategies 
is  that  we  analyze  the  sparseness  of  the  solution  to  the  penalized  problem  and  then  further 
exploit  sparseness  to  control  empirical  errors  in  the  sample  criterion  function.  As  a  result,  we 
derive  not  only  the  results  on  model  selection  and  on  sparseness  of  solutions,  which  are  of  a 
prime  interest,  but  also  the  results  on  the  consistency  and  rates  of  convergence  under  weak 
conditions  on  the  number  of  non-zero  components  s.  As  mentioned  above,  our  approach 
allows  s  to  be  of  almost  the  same  order  as  the  sample  size  n,  and  delivers  convergence 
rates  that  are  close  to  a/s/tz.  In  contrast,  van  der  Geer's  [37]  approach  requires  s  to  be 
much  smaller  than  n,  namely  s'^/n  — >  0,  and  thus  does  not  deliver  consistency  or  rates  of 
convergence  when  s'^/n  — >  oo.  .  :  '       ,        .  .  .  ,      :- 

In  our  proofs  we  critically  rely  on  two  key  quantities:  the  number  of  non-zero  components 
m  =  \\(3{u)\\o  of  the  solution  /3{u)  and  the  empirical  error  in  the  sample  criterion,  Qu{P)  — 
QuiP)  (with  P  ranging  over  all  m-dimensional  submodels  of  the  large  p-dimensional  model). 
In  particular,  we  make  use  of  the  following  relations:  -     . 

(1)  lower  m  implies  smaller  empirical  error,  and 

(2)  smaller  empirical  error  can  imply  lower  m. 

Starting  with  a  structural  initial  upper  bound  on  m  (see  condition  E.3  below)  we  can  use 
the  two  relations  to  solve  for  sharp  bounds  on'  m  and  the  empirical  error,  given  which  we 
then  can  solve  for  convergence  rates. 

Let  us  comment  on  the  intuition  behind  relations  (1)  and  (2).  Relation  (1)  follows  from 
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an  application  of  the  usual  entropy-based  maximal  inequalities,  upon  realizing  that  the  en- 
tropy of  all  m  dimensional  models  grows  at  the  rate  mlogp.  In  particular,  the  lower  the  m., 
the  closer  the  sample  criterion  function  Q^  to  a  locally  quadratic  function,  uniformly  across 
all  m-dimensional  submodels.  Relation  (2)  follows  from  the  use  of  ^i-penalty,  which  tends 
to  favor  lower-dimensional  solutions  when  Qu  is  close  to  being  quadratic.  Figure  1  provides 
a  visual  illustration  of  this,  using  a  two-dimensional  example  with  a  one-dimensional  true 
minimal  submodel;  in  the  example,  the  true  parameter  value  {Pi{u),  (32{u))  is  (1,0).  Fig- 
ure 1  plots  a  diamond,  centered  at  the  origin,  representing  a  contour  set  of  the  ^i -penalty 
function  and  a  pearl,  representing  a  contour  set  of  the  criterion  function  Q^.  By  the  dual 
interpretation  (2.9)  of  our  estimation  problem,  the  penalized  estimator  looks  for  a  minimal 
diamond,  subject  to  the  diamond  having  a  non-empty  intersection  with  a  fixed  pearl.  The 
set  of  optimal  solutions  is  then  given  by  the  intersection  of  the  minimal  diamond  with  the 
pearl.  Smaller  empirical  errors  shape  the  pearl  into  an  ellipse  and  center  it  closer  to  the  true 
parameter  value  of  (1,0)  (left  panel  of  Figure  1).  Larger  empirical  errors  shape  the  pearl 
like  a  non-ellipse  and  can  center  it  far  away  from  the  true  parameter  value  (right  panel  of 
Figure  1).  Therefore,  smaller  empirical  errors  tend  to  cause  sparse  optimal  solutions,  cor- 
rectly setting  P2{u)  =  0;  larger  empirical  errors  tend  to  cause  non-sparse  optimal  solutions, 
incorrectly  setting  /?2(w)  7^  0. 
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Fig  1 .  These  figures  provide  a  geometric  illustration  for  the  discussion  given  in  the  text  concerning  why 
Ci-penalized  estimation  may  be  (left  panel)  or  may  not  be  (right  panel)  successful  at  selecting  the  minimal 
true  model. 


3.    Analysis  and  Main  Results  Under  High-Level  Conditions.     In  this  section 
we  prove  the  main  results  under  general  conditions  that  encompass  the  simple  conditions 
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D.1-D.4  as  a  special  case. 

3.1.  The  Five  Basic  Conditions.  We  will  work  with  the  following  five  basic  conditions 
E.1-E.5  which  are  the  essential  ingredients  needed  for  our  asymptotic  approximations.  In 
Appendix  A,  we  verify  that  conditions  E.1-E.5  hold  under  simple  sufficient  conditions  D.l- 
D.4  stated  in  Section  2,  and  we  also  show  that  E.1-E.5  arise  much  more  generally.  In 
particular,  in  Appendix  A  we  characterize  key  constants  appearing  in  E.1-E.5  in  terms  of 
the  parameters  of  the  model. 

E.l.  True  Model  Sparseness.  The  true  parameter  value  l3{u)  has  at  most  s  <  n/  log(n  V  p) 
non-zero  components,  namely 

(3.1)  ||/3(u)||o  =  s<n/log(nVp). 

E.2.  Identification  in  Population.  In  the  population,  the  true  parameter  value  /3(u)  is  the 
unique  solution  to  the  quantile  objective  function.  Moreover,  the  following  minorization 
condition  holds, 

(3.2)  QuiP)  -  QuWiu))  >  q  (||/3  -  Piu)f  A  g{\\P  -  /3(u)||))  , 

uniformly  in  /3  G  ]R^,  where  g  :  R+  — »  R+  is  a  fixed  convex  function  with  ^'(0)  >  0,  and 
g  is  a  sequence  of  positive  numbers  that  characterizes  the  strength  of  identification  in  the 
population. 

E.3.  Empirical  Pre-Sparseness.  The  number  m  =  ||/3(u)||o  of  non-zero  components  of  P{u) 
of  the  solution  to  the  penalized  quantile  regression  problem  (2.4)  obej''s  the  inequality 

(3.3)  m<77,ApA         ),    \         .,  -■  - 

A"  .    "'■ 

where  (/>(m)  is  the  maximal  m-sparse  eigenvalue. 

E.4-  Em.pirical  Sparseness.  For  r  =  ||/?('u)— /3(u)||,  m  =  ||^(u)||o  obeys  the  following  stochas- 
tic inequality 


,„  ,,  , —  ^       n  ,         .         , —  v'nlog(n  V  p)(j)(m) 

(3.4)  Vm  <p  H- {r  A  I)  +  x^y        ^^       ^'^^     ' 


where  /i  >  g  is  a  sequence  of  positive  constants.  The  sequence  of  constants  ^  is  determined 
by  the  population  analog  of  the  empirical  sparse  eigenvalue  (f){mo)  (cf.  Appendix  A). 

E.5.  Sparse  Control  of  Empirical  Error.  The  empirical  error  that  describes  the  deviation 
of  the  empirical  criterion  from  the  population  criterion  satisfies  the  following  stochastic 
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inequality 


UP)  -  QuiP)  -  {QuWiu))  -  Qumu)))\  <p  r^^— 


(35)        lOy^'      ^  ""'       '^   ^/3'".^^      /^   /^/..^^  \  i  <-     ^.  /  ^ s)  log(n  V  p)(^(m  +  s) 


uniformly  over  {/?  G  M''  :  ||/3||o  <  m  A  n  A  p,   ||/?  —  /3(u)||  <  r},  uniformly  over  m  <  n,  r  >  0. 

Let  us  briefly  comment  on  each  of  the  conditions.  As  stated  earlier,  condition  E.l  is  a 
basic  modeling  assumption,  and  condition  E.2  is  an  identification  assumption,  required  to 
hold  in  population.  Conditions  E.3  and  E.4  arise  from  two  characterizations  of  sparseness  of 
the  solution  to  the  optimization  problem  (2.4)  defining  the  estimator.  Condition  E.3  arises 
from  simple  bounds  applied  to  the  first  characterization.  Condition  E.4  arises  from  maxi- 
mal inequalities  applied  to  the  second  characterization.  Condition  E.5  arises  from  maximal 
inequalities  appUed  to  the  empirical  criterion  function.  To  derive  conditions  E.4  and  E.5, 
we  crucially  exploit  the  fact  that  the  entropy  of  all  m-dimensional  submodels  of  the  p- 
dimensional  model  is  of  order  mlogp,  which  depends  on  p  only  logarithmically.  Finally,  we 
note  that  Conditions  E.1-E.5  easily  hold  under  primitive  assumptions  D.1-D.4,  in  particular 
/i  ~  g  ~  1,  but  we  also  permit  them  to  hold  more  generally.  We  refer  the  reader  to  Section 
5  for  verification  and  further  analysis  of  these  conditions. 

Theorem  1  combines  conditions  E.1-E.5  to  establish  bounds  on  the  rate  of  convergence 
and  sparseness  of  the  estimator  (2.4). 

Theorem  1.  Assume  that  conditions  E.1-E.5  hold.  Lett  -^p  oo  he  a  sequence  of  positive 
numbers,  possibly  data-dependent,  define 


(3.6)  TTiQ  =  p  A  I  :^ — r~^     J   and  set  X  =  tJnlog(nV  p)4'(mo  +  s)  —  . 

'  log(n  W  p)  fi-  J  .  *  q 

Then  we  have  that 


(3.7)         ;'        ||,5(u)-/3(.)||S^  =  ^/'^°'^"■''''^^^"°^'^'' 


qn  V  '^  9^ ' 


provided  that  \^/s/{qn)  -^p  0,  and  ''  - 

(3.8)  II^NIIo<p(t)    ^• 


^V 


<?/ 


This  is  the  main  result  of  the  paper  that  derives  the  rate  of  convergence  of  the  {'i-penalized 
quantile  regression  estimator  and  a  stochastic  bound  on  the  dimension  of  the  selected  model. 
Our  results  parallel  the  results  of  Meinshausen  and  Yu  [26]  obtained  for  the  ^i-penalized 
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mean  regression.  We  refer  the  reader  to  Section  2  for  a  detailed  discussion  of  this  and  other 
main  results  of  this  section  under  simplified  conditions.  Here  we  only  note  that  the  rate  of 
convergence  generally  depends  on  the  number  of  significant  regressors  s,  the  logarithm  of 
the  number  of  regressors  p,  the  strength  of  identification  q,  the  empirical  sparse  eigenvalue 
0(mo),  and  the  constant  /U  determined  by  the  population  sparse  eigenvalue.  The  bound  on 
the  dimension  also  depends  on  the  sequence  of  constants  s,  q,  and  /x. 

It  is  also  helpful  to  state  the  main  result  separately  under  the  simple  set  of  conditions 
D.1-D.4,  where  g  ~  /i  ~  1. 

Corollary  1  (A  Leading  Case).      Conditions  D.I-D.4  imply  conditions  E.1-E.5  with 
g  ~  /u  ~  1.  Therefore,  under  D.I-D.4.  Wo  =  p  A  (r7./log(n  Vp)),  so  setting 


X  —  tJn\og(ny p)(p(m-oj    and  ij   t\ >0 


slog(n  Vp)(/)(mo) 


we  have  that 


and 


I   \      /?r   Ml  <r    ^  /slog(nVp)(p(mo) 
{u)-P{u)\\  <pt\' 


n 


-■    mu)h<ps.     -■  • 

If  in  addition  (/'(mo)  <p  1,  then  we  obtain  the  rate  result  listed  in  equation  (2.17). 

This  corollary  follows  from  lemmas  stated  in  Appendix  A,  where  we  verify  that  conditions 
D.I-D.4  imply  conditions  E.1-E.5.  Moreover,  we  use  the  fact  that  (^(mo  +  s)  <  (f){2mo)  if 
s  log(n  V  p)  <  n  for  mo  =  p  A  (n/  log(n  V  p)),  and  that  0(2mo)  <  2(/>(mo)  by  Lemma  iL 

It  is  useful  to  revisit  our  concrete  examples. 

Example  3  (Isotropic  Normal  Design,  continued).  In  the  isotropic  normal  design  con- 
sidered earlier,  recall  that  we  have  that  0(fc)  <p  1  +  \/{k/n)  logp.  If  X/ ^nlog(n  Vp)  — »  00, 
by  Theorem  1  we  have  m-o  <  n/log(nVp),  and,  since  we  assume  s  <  n/log(nVp),  by 
Lemma  11  we  have  (p{mo  +  s)  <p  1.  Also,  we  verify  in  Appendix  A  that  g  ~  /x  ~  1.  Thus, 
the  rate  result  listed  in  equation  (2.17)  applies  to  this  example. 

Example  4  (Correlated  Normal  Design,  continued).  In  the  correlated  normal  design 
considered  earher,  we  have  that  (f){k)  <p  Yr||(l  +  \/(fc/n)  logp).  If  X/^/nlog{n  Vp)  -^  00, 
by  Theorem  1  we  have  mo  <  n/  log(nVp)  and,  since  we  assume  s  <  n/  log(nVp),  by  Lemma 
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11  we  have  4>{mo  +  s)  <p  j^y  <p  1.  Also,  we  verify  in  Appendix  A,  that  g  ~  /x  ~  1.  Thus, 
the  rate  result  listed  in  equation  (2.17)  applies  to  this  example  too. 

Proof.  (Theorem  1)  Let  '       -  '    ■         '    •  '      ■■ 

^     :      ■       ■'-.;     ^     ".      r:=||^(^i)-^(u)||andm:=||^(u)||o.  ..;/      -.-^       /'      . 

The  proof  successively  refines  upper  bounds  on  m  and  r.  We  divide  the  proof  in  four 
steps.  The  first  step  provides  an  initial  bound  on  m,  the  second  step  obtains  prehminary 
inequalities,  the  third  step  verifies  consistency,  and  the  fourth  step  establishes  the  rate 
result. 

Step  1.  We  start  by  proving  that  m  <  mo  if  i  >  \/2.  Since  t  -^p  oo,  m  <  rriQ  will  occur 
with  probability  converging  to  one.  By  condition  E.3  we  have 

n'^(p{m) 


m  <  m  =  max  im:  m<nApA 


A2 


i         n         Q 

If  mo  =  p  we  have  directly  tliat  m  <  m^.  Next  consider  the  case  mo  =     ; ^ 

\log{n  \J  p)  ij,^ 

Suppose  that  m  >  m-o  when  t  >  \f2.  Therefore  we  have  m  —  m.oi  for  some  i  >  I  (since 
m  <  n  A  p  is  finite) .  By  definition  fh  satisfies  the  inequality 

0(m)  -  .        "    ■    ■        . 


(3.9)  ■-        ,  '         m<n'' 


A2 


Since  (/>(mo)  <  4>{mo  +  s)  we  have  A  >  ty^n  log(n  V  p)<p{mo){fi./q).  Inserting  this  bound  on 
A,  the  value  of  mo,  and  fh  —  mol  in  (3.9),  and  then  using  Lemma  11  and  t  >  \/2  we  obtain 

n^  <l>(moi)  q^  n  „„<?"        2        ^ 

f''nlog(nVp)  (p{m.o)  /U         t- log(n  V  p)      fi-       t" 

which  is  a  contradiction. 

Step  2.  In  this  step  we  obtain  some  preliminary  inequalities. 

By  Condition  E.l,  tire  support  of  /?(u) 

T^  :=    support(/3(u))  :=  {j  S  {1, . .  .  ,p}  :    |/?,(u)|  >  0} 

has  exactly  s  elements,  that  is,  \Tu\  —  s.  Let  Pt^{u)  denote  a  vector  whose  r„  components 
agi'ee  with  T^  components  of  P{u),  and  whose  remaining  components  are  equal  to  zero. 
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B}'  definition  of  ^(u)  and  since  ||^ru(")lli  ^  ll/§('")lli  we  have  that 

Quik^))  -  QuiPiu))  <  ^(||/?(u)||i  -  mu)h)  <  ^(||/3(u)||i  -  ||^T„(^^)||l). 

Using  that 

Ill/9(ti)lli  -  Pt.(u)||i|     <     mu)  -  PTAu)h  <  yfil^mAu)  -  P{u)\\  <  Vsr 

we  obtain  that 

Qu0{u))  -  QuWiu))  <  -v^r. 

n 

Applying  condition  E.5  to  control  the  difference  between  the  sample  and  population  criterion 

functions,  we  further  get  that 


^   /-PT/  ss      ^   /^/   XN     ^       ^    r-  (m  + s)\og,(n\/ p)d>(m  + s) 


n  V  n 

Invoking  the  identification  condition  E.2  and  the  definition  of  r,  we  obtain 


/Q  ^n^  (  2  .     (  ^^  <    ^   r    ^      /(m  +  s)  log(n  V  p)(?!>(m  +  s) 

(3.10)  q{r~  ^g[r))<p-^/sr +  r\' 


n  V  n 


Step  3.    In  this  step  we  show  consistency,  namely  r  =  Op(l).  By  Step  1  we  have  m.  <  mo 
with  probability  converging  to  one.  ■      ;, 

The  construction  (3.6)  of  A,  t  — +p  oo,  and  the  condition  Xy/s/{qn)  — >p  0  assumed  in  the 
theorem  imply 


A^s  ls\og{nVp)(l>{Tno  +  s)  n  ^n\og(nV  p)<i){m.o  +  s) 

n  \  n  q  A 


Condition  {in),  fJ.  >  q,  and  empirical  sparseness  condition  E.4,  stated  in  equation  (3.4), 
imply  that 

(3.11)  -  ■    •    ■  v^<p/i(7-Al)n/A  + v^Op(l),  ■■'-,•     -        ■■■■ 
which  implies  the  following  second  bound  on  m: 

(3.12)  V^<p^n/X. 
Using  (3.12)  and  m  >  s  in  equation  (3.10)  gives 


^o  ION              •,  r             1     /  2  .     /   ^^  ^       ^    /-        v/n log(n  V p)(j){rrn)  +  s)/x  ,   , 

(3.13)  l{m>  s]q[r^  hg[r))  <pr-^s  +  r^ ^ "^ ^  =  rOp{q) 
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where  the  last  equality  follows  by  conditions  (i)  and  {in).  On  the  other  hand,  using  (3.12) 
and  m  <  s  in  equation  (3.10)  gives 

A    ^  /sIog(nVp)(?i(mo  +  s) 


^      /    o  ,    -s\     ^         ^     r-  S  log  n  V  piffllTTln  i-  5)  ,    , 

(3.14)  \{m.<s)q{f-  Kg(r))  <^r-^s^T^-^^ ^-^^^^ =  roM 

where  the  last  equality  follows  by  conditions  (z)  and  iii)  and  /J.  >  q.  Conclude  from  (3.13) 
and  (3.14)  that 


(3.15)  .       .  /  q[r'/\9{r))  =rOj,{q). 

Next  we  show  that  (3.15)  implies  r  =  Op(l).  Dividing  both  sides  of  (3.15)  by  q  and  by 
r  we  have  l{r  >  0}[r  A  {g{r)/r)]  <p  l{r  >  0}op(l).  By  condition  E.2,  g  is  a  fixed  convex 
function  with  g'(0)  >  0,  so  that  g(r)  >  g'(0)r.  Thus,  l{r  >  0}[r  A5'(0)]  =  l{r  >  0}op(l), 
that  is,  r  =  Op(l). 

Step  4.  This  step  derives  the  rate  of  convergence. 

Using  that  r  =  Op(l)  we  improve  the  bound  (3.11)  on  m  to  the  following  third  bound: 

(3.16)  .,  ^</J^.    . 

Plugging  (3.1G)  into  (3.10)  and  using  the  relation  r^  =  Op{g{r))  under  r  —  Op(l),  gives  us 

X  /s  X  /s 

(3.17)  gr^  <„  r (-  Op(q)r'^   or  equivalently   r  <p  . 

n  qn 

Finally,  inserting  (3.17)  into  (3.16),  we  obtain  -Jm  <p  ^/s{fi/q),  which  verifies  the  final 
bound  (3.8)  on  m.  □ 

3.2.  Model  Selection  Properties.  Next  we  turn  to  the  model  selectioir  properties  of  the 
estimator. 

Theorem  2.  //  conditions  of  Theorem  1  hold,  and  if  the  non-zero  components  of  j3{u) 
are  separated  away  from  zero,  namely 


(3.18)  mm         |/3j(u)|  >  «W ^, 

jesuppoTt{0(,u})  \  n  q- 

for  some  diverging  sequence  i  of  positive  constants,  i  — >  cx),  then  with  probability  approaching 
one  ■  ■■ 

(3.19)  support  {P(u))  C  support  {0{u)). 
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Moreover,  the  hard-thresholded  estimator  P{u),  defined  by 


n  (   \       n  I   \^)  \R  (    W  ^  oU     s\og{n.yp)4>{mQ  +  s)  /i  I 
0j[u)  =/?j(u)W  |/3j(u)|  >  it\i -2  > 

where  £'  — »  oo  and  (.'  jt  — >  0,  satisfies  with  probability  converging  to  one, 

support  (/3(u))  =  support  {(3{u)). 

Theorem  2  derives  some  model  selection  properties  of  the  £]  —penalized  quantile  regres- 
sion. These  results  parallel  analogous  results  obtained  by  Meinshausen  and  Yu  [26]  for  the 
£i-penalized  mean  regression.  The  first  result  says  that  in  order  for  the  support  of  the  estima- 
tor to  include  the  support  of  the  true  model,  non-zero  coefficients  need  to  be  well-separated 
from  zero,  which  is  a  stronger  condition  than  what  we  required  for  consistency.  The  inclu- 
sion of  the  true  support  is  in  general  one-sided;  the  support  of  the  estimator  can  include 
some  unnecessary  components  having  the  true  coefficients  equal  zero.  The  second  result  de- 
scribes the  performance  of  the  fj -penalized  estimator  with  an  additional  hard  thresholding, 
which  does  eliminate  inclusions  of  such  unnecessary  components.  However,  the  value  of  the 
right  threshold  explicitly  depends  on  the  parameter  values  characterizing  the  separation  of 
non-zero  coefficients  from  zero. 

Proof.  (Theorem  2)  The  result  on  inclusion  of  the  support  stated  in  equation  (3.19) 
follows  from  the  separation  assumption  (3.18)  and  the  inequahty  \\(3{u)  —  P{u)\\oo  <  0iu)  — 
P{u)\\.  Indeed,  by  Theorem  1  we  have  with  probability  going  to  one, 

(3.20)  Mu)  -  /3(u)|U  <  mu)  -  /3{u)\\  <  min  |,5,(u)|. 

;£support(/3(ii)) 

The  last  inequality  follows  from  the  rate  result  of  Theorem  1  and  from  the  separation 
assumption  (3.18).  Next,  the  converse  of  the  inclusion  event  (3.19)  implies  that  ||/3(w)  — 
/3(w)||oo  >  niiiij£support(/3(u))  |/?j(^i)|-  Since  the  latter  can  occur  only  with  probability  ap- 
proaching zero,  we  conclude  that  the  event  (3.19)  occurs  with  probability  converging  to 
one.  ,       ^        ,  ,"    ;■  ,  ' 

Consider  the  hard-thresholded  estimator  next.  Let  r„  =  t^(s/n)  log(n  V  p)(f){ma  +  s)iJ,/q'^. 
To  establish  the  inclusion  note  that  by  Theorem  1  and  the  separation  assumption  (3.18) 

.        min  \P,iu)\  >  min       J\3j{u)\  -  \P,{u)  -  P,iu)\}  >p  £r„  -  r„ 

jesupport  {p{u))  jSsupport  (/3(u)) 
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SO  that  rnirijggupport  (/3(u))  |^j(w)|  >  ^'^n  with  probability  going  to  one  by  £'  ^  oo  and 
£'/£  — >  0.  Therefore,  support  (/?(u))  C  support  (/3(u))  with  probabihty  going  to  one.  To 
establish  the  opposite  inclusion,  consider  the  quantity 

e„=  max  0j{u)\. 

■     .  jgsupport  (P(u)) 

B_y  Theorem  1  e„  <p  r„  so  that  £„  <  ^V„  with  probability  going  to  one  by  £'  ^  cx3.  Since 
by  the  hard-threshold  rule  all  components  smaller  than  i'r„  are  excluded  from  the  support 
of  ^(u),  we  have  that  support  {0{u))  C  support  (/3(u))  with  probability  going  to  one.       D 

3.3.  Two-step  estimator.  Next  we  consider  the  following  two-step  estimator  that  apphes 
the  ordinary  quantile  regression  to  the  selected  model.  Let  T"  be  a  model,  that  is,  a  subset  of 
{1, . . .  ,p},  selected  by  a  data-dependent  procedure.  We  define  the  two-step  estimator  /3'^{u) 
as  a  solution  of  the  following  optimization  problem:  .         ,,  ' 

(3.21)      ■  .  ^(u)earg         min         Q„(/3). 

In  this  problem  we  constrain  the  components  of  the  parameter  vector  /?  that  were  not 
selected  to  be  zero;  or,  equivalently,  we  remove  the  regressors  that  were  not  selected  from 
further  estimation.  Moreover,  we  no  longer  use  ^i-penaUzation. 

Theorem  3.  Suppose  that  conditions  E.l,  E.2,  and  E.5  hold.  Let  T  be  any  selected 
model  that  contains  the  true  model  T^  with  probability  converging  to  one,  and  whose  dimen- 
sion \T\  is  of  stochastic  orders,  then 


Is  login  \Jp)4>{s)l 


P^iu)  -  piu) 

IV  ^  9 

provided  the  right  side  converges  to  zero  m  probability. 

Under  conditions  of  the  theorem  see  that  the  rate  of  convergence  of  the  two-step  estimator 
is  generally  faster  than  the  rate  of  the  one-step  penalized  estimator,  unless  0(n)  ~p  (p{s), 
in  which  case  the  rate  is  the  same.  It  is  also  helpful  to  note  that  when  g  ~  1  and  0(s)  <p  1, 


\\f  iu)-P{u)\\<p  ^^  login  Vp). 


Proof.   (Theorem  3).  Let  r  =  ||/3^(u)  -  /9(u)||.  By  definition  of  ^^(u)  and  by  r„  C  f 
with  probability  approaching  one,  we  have  that  with  probability  approaching  one 

Qu{fiu))-QuiP{u))<0. 
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First  note  that  since  \T\  <p  s,  by  Lemma  11  we  have  that  (t){\T\  +  s)  <p  (p{s).  Applying 
condition  E.5  to  control  the  empirical  error  in  the  objective  function,  we  get  that 


s\og{nV  p)(f>{\T\  + s)  I  slog{n\/  p)(p{s) 

n 


QuiP   [u))  -  Qu[p{u))     <p    r]l <p 

Invoking  the  identification  condition  E.2  we  obtain  that 


(3.22)  .(.^A,(r))<,^./^^°^("^^)^^^) 


Since  we  assumed  that  ^/s  log(n  V  p)4>{s)/n  =  Op{q),  we  conclude  that  q{r~Ag{r))  <p  rOp{q). 
As  in  the  proof  of  Theorem  1,  this  implies  that  r  =  Op(l),  and  that  r^  =  Op(g{r)).  Therefore 
we  can  refine  the  bound  (3.22)  to 


slog(n  Vp)0(s)  ^      j s\og{nV  p)(j){s)  I 

n  g' 


qr    '^pT\j or   r  < 


proving  the  result.  D 

4.  Analysis  of  the  Pivotal  Choice  of  the  Penalization  Pcirameter.  In  this  section 
we  show  that  under  some  conditions  the  pivotal  choice  for  the  penalization  parameter  A 
proposed  in  Section  2.3  satisfies  the  theoretical  requirements  needed  to  achieve  the  rates  of 
convergence  stated  in  Theorem  1.    ,  .  _^^ 

Recall  that  the  true  rank  scores  can  be  represented  almost  surely  as     ■  .  .,       .    ,  -    •„ 

al{u)  =  [u  —  \{ui  <u]),    for  i  =  1, . . .  ,n, 

where  ui,. ..  ,Un  are  i.i.d.  uniform  (0, 1)  random  variables,  independently  distributed  from 
the  regressors,  xi, . . . ,  x„.  Thus,  we  have 


(4.1)  ■     -  A„  =  n||En[x,(u-l{u,  <u})]||^, 
which  has  a  known  distribution  conditional  on  A'  =  (xj, . . .  ,x„). 

Theorem  4.     Let  the  regularization  parameter  A(A')  be  defined  as 

(4.2)  A(A')  =  inf{A:  P(A„  <  A|A)  >  1-Q>,},     Q„  =  ^     ^ 


n\J  p 
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for  some  sequence  f  — >  oo.  Assume  that  there  exists  a  sequence  Cn^p  such  that  uniformly  in 

'   /   n  Nl/2'  ^ 

(4.3)  jn^^    i^ul  <  c„,p  1  X]  ^u  )  ""^   '^".P  ■  *  ■  V  l°g("  V  p)  ^  0. 


Moreover,  assume  g  ^  ^,  (i'(l)  — p  (?)(n/log(n  V  p)),  ond  i/iai  ii/slog(n  V  p)(t>{l)/n/q  — >p 
0.  r/ien  A  =  X{X)  satisfies  the  assumptions  on  the  regularization  parameter  assumed  in 
Theorem  1,  namely  there  exists  a  sequence  t  — ►p  oo  such  that 


(4.4)  X  =  tJnlog(n\/ p)4>{mo  + s)—       and »p  0 

f         ri  q^\  - 

where  tuq  =  P  A     :; — ; — ;r    ,  and  i  ~p  i.  ■  -  . 

yiog(nVp)/i-y  '^  :         .        •  ,    . 

Proof.  (Theorem  4)  We  will  use  the  following  inequalities  of  Stout  [34],  Theorem  5.2.2: 
Let  {A',,?'  >  1}  denote  a  sequence  of  independent  random  variables  with  zero  mean  and 
finite  variances,  and  let  5„  =  12"=!  ^i  ^^^  *n  =  I3r=i  ^  [-^f]  ^°^  all  n  >  1.  Let  \Xi\  <  cs„ 
almost  surely  for  each  I  <  i  <  n  and  n  >  1.  Suppose  e  >  0  and  7  >  0.  Then  for  each  n  >  1, 
the  inequahty  ec  <  1  implies  that 


(4.5)  P(5„/s„  >£)<exp(-(£72j(l-ec/2)j, 

and  there  exist  constants  £(7)  and  7r(7)  such  that  if  £  >  £(7)  and  ec  <  7r(7),  then 

(4.6)  ■  P(5„/s„>£)>exp(-(£V2)(l  +  7)' 


We  need  to  establish  upper  and  lower  bounds  on  the  value  of  A.  We  first  establish  an 
upper  bound.  Let  v^  =  X!"=i  ^^f,  and  note  that  0(1)  =  supj<pD?/n.  Next  observe  that 
Var  (J^"^i  Xija*(u)|X)  =  u(l  -  u)v'j.  Note  that  by  (4.3)  we  have  supj<,<„  |x,ja*(u)|  < 
Cn.pVj/\/u{l  —  u),  j  =  l,...,p.  Moreover,  for  n  large  enough,  condition  (4.3)  also  implies 
that 


(4.7)  2c„,p(i  +  l)0og(n  Vp)/yu(l  -  u)  <  1/2. 

Under  (4.7),  we  can  apply  (4.5)  with  e  =  2{t  +  l)v^log(n  Vp),  and  c  =  Cn,p/\/u{l  —  u)  to 
obtain  that  for  every  j  =  I, . . .  ,p 
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Therefore,  since  y/n$(l)  >  Vj  we  have 

(4.9)     p(^^^l^M$^>e\x]<exp(-{e  +  l)\og{nVp))=       ^ 


\/u{l  —  u)n(l){l) 

Next  note  that  using  (4.9)  we  have 
(4.10) 

P  (a„  >  V"(l  -  u)n4>{l)6\x)     <     JZP 


(n  Vp)  \{ny  p) 


i=\ 


<     p  max  P 


^a;yO*(u 

n 

^x,j<(w 


>  Ju(l  -  u)n(?!>(l)£|X 


1=1 


>  Ju{l-u)n(l){l)e\X 


Since  P(A„  >  A|X)  is  decreasing  in  A,  we  conclude  that 
(4.11) 


A  <  yju{l  -  u)n4>il)s  <  2{t  +  l)yJnlog{nVp)(l)il). 


Next  we  turn  to  establishing  the  lower  bound.  Let  jn  G  {1, . . .  ,p}  denote  an  index  such 
that  Vj^  =  i/ni^(l).  By  definition  of  A„  we  have 


1 


{nV  p) 


>     P  I  max 

\j<p 


^a;,ja*(u) 


1=1 


>A|A')  >P[\J2xrjX{u] 
1=1 


>  MX 


Fix  7  >  0  (which  implicitly  fix  £(7)  and  7r(7)),  and  set  e  =  ti/21og(n  V  p)/(l  +  7),  c  = 
Cn,p/\/u{l  —  u).  Since  £  diverges,  and,  by  (4.3)  we  have  ec  =  o(l),  for  n  large  enough  we 
have  e  >  £(7)  and  ec  <  7r(7).  Therefore  we  can  apply  (4.6)  to  obtain 

pf\E7=iX,,„a'du)\  \     ^     exp(-(£V2)(l+7)) 

\Vu{l  -u)n(f){l)  J 

\.      ,       .  ,,    ...        >     exp(-t2log(nVp))  =  (^ 

Since  P  (A„  >  A|A')  is  decreasing  in  A,  it  follows  that 
(4.12)  A  >  eyju{l  -u)n(j){l)  =t^2u{l  -  u)nlog(n  Vp)0(l)/(1  +  7) 


Thus,  taking  in  account  that  /i  ~  g,  we  have  established  A  ~p  t^Jn\og{n  V p)(p{mQ  +  s)^. 
In  order  to  verify  (4.4)  define  t  =  A/[\/nlog(n  V  p)(p{mo  +  s){^/q)].  By  construction  we 
have  that  i  ~p  i  — >  00.  Thus,  the  first  result  of  (4.4)  follows,  and  the  second  result  of  (4,4) 
follows  from  the  assumptions  that  t^/s  log(n  V  p)4>{l)/n/q  -^p  0.  ■    !     '    -.    D. 
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For  concreteness,  we  now  verify  the  conditions  of  Tlieorem  4  in  our  examples. 

Example  5  (Isotropic  Normal  Design,  continued).  Let  x.j  denote  the  n-vector  associ- 
ated with  the  jth  covariate,  where  x.i  is  a  column  of  ones  representing  the  intercept.  Next 
we  use  standard  Gaussian  concentration  bounds,  see  [2-3]  Section  3.  For  any  value  K  >  \ 
we  have  ,  ,  ,      •  .     ,        . 

(4.13)-  •.     ■  P(|a;,j|  > /r)  <exp(-/\V2).  '" 


In  turn  this  implies  that  max] <,;<„, i<j<p  \xij\  <p  \/\og(n  Vp).  Moreover,  the  vectors  x.j  are 
such  that 

(4.14)    P{\  ||x.,||-E[||x.,||]  |>A')<2exp(-2A'V7r2)   a.nd  E{\\x.,\\]  ^  V^.,j  =  1, . . .  ^p. 


Combining  these  bounds  we  obtain  minj=i_.  ,^p  \jYL^=i  ^u  ~p  V^"  \/logp.  Therefore,  con- 
ditions (4.3)  hold  with  Cn^p  —p  y  "s^^-^p^  and  i^log^(n  V  p)  =  o{n).  On  the  other  hand,  we 
have  (?l)(l)  >  1  and  ^(mo  +  s)  <p  \  +  y(m.o/n)logp  ■+  yj[s/n)  logp  <  1  by  Lemma  14  and 
the  definition  of  mo.  Thus,  Theorem  4  requires  i^slog(n  V  p)  =  o{n).  We  also  verify  that 
g  ~  /u  ~  1  in  the  next  section. 

Example  6  (Correlated  Normal  Design,  continued).  We  analyze  the  correlated  de- 
sign similarly  using  comparison  theorems  for  Gaussian  random  variables.  Corollary  3.12  of 
Ledoux  and  Talagrand  [23].  The  upper  bound  for  the  case  p  >  0  follows  from  the  result 
that  for  K  >l 


(4.15)  P  max        |x,,|  >  A'     <  P  \        max        |3„|  >  K 

where  z^j  ~  A^(0, 1)  are  i.i.d.  as  in  Example  5.  (The  case  with  p  <  0  follows  by  changing 
the  signs  of  x^  for  each  even  j  and  redefining  the  parameter  j3{u)  for  these  new  regressors; 
so  that  after  the  transformation  we  obtain  the  design  with  p  >  0.)  The  lower  bound  relies 
only  on  the  independence  within  the  components  of  each  vector  x.j.  Since  x^ij  and  Xij 
are  independent  for  i'  j^  i,  we  can  invoke  the  same  results  of  Example  5.  Therefore  we 
obtain  Cn,p  —p  w-^Si^^  ^nd  t"^  log'^(n  Vp)  =  o(n).  In  addition,  0(1)  >  1  and  0(mo  +  s)  <p 
{(l  +  |p|)/(l-|p|)}  (l  +  v/(mo/n)  logp +y(s/n)  logp)  <  {(l  +  |p|)/(l  -  |p|)}  by  Lemma  14 
and  the  definition  of  mo.  Since  p  is  fixed  it  follows  that  0(1)  ~p  0(mo  -I-  s).  Thus,  Theorem 
4  also  requires  t^slog(rt  V  p)  =  o{n)  in  this  case.  We  also  verify  that  g  ~  p  ~  1  in  the  next 
section. 
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5.  Empirical  Performance.  In  order  to  access  the,  finite  sample  practical  perfor- 
mance of  the  proposed  estimators,  we  conducted  a  Monte  Carlo  study  and  an  application 
to  international  economic  growth. 

5.1.  Monte  Carlo  Simulation.  In  this  section  we  will  compare  the  performance  of  the 
canonical  quantile  regression  estimator,  the  £i -penalized  quantile  regression,  the  two-step 
estimator,  and  the  ideal  oracle  estimator.  Recall  that  the  two-step  estimator  applies  the 
canonical  quantile  regression  to  the  model  selected  by  the  penalized  estimator.  The  oracle 
estimator  applies  the  canonical  quantile  regression  on  the  minimal  true  model.  (Of  course, 
such  an  estimator  is  not  available  outside  Monte  Carlo  experiments.)  We  focus  our  attention 
on  model  selection  properties  of  the  penalized  estimator  and  biases  and  standard  deviations 
of  these  estimators. 

We  begin  by  considering  the  following  regression  model  (see  Example  1)  where 

y  -  x'Pil/2)  +  e,    /?(l/2)  =  (1,1,1,1,1,0,...,  0)', 

where  an  intercept  and  the  covariates  x_i  ~  N{Q,I),  and  the  errors  e  are  independent 
identically  distributed  e  ~  A''(0, 1).  We  set  the  dimension  p  of  covariates  x  equal  to  1000, 
and  the  dimension  s  of  the  true  minimal  model  to  5,  and  the  sample  size  n  to  200.  We 
set  the  regularization  parameter  A  equal  to  0.9-quantile  of  the  pivotal  random  variable  A„, 
following  our  proposal  in  Section  2.     "      ':      ■;'.-"'■  -   .     ="  ■ 

We  also  consider  a  variant  of  the  model  above  with  correlated  regressors,  namely  x_i  ~ 
N{0,T,),  where  T,ij  =  p''"-'',  as  specified  in  Example  2  with  p  =  0.5.  This  design  is  note- 
worthy because  it  violates  the  condition  of  the  uniform  uncertainty  principle,  but  it  easily 
satisfies  our  conditions.  ; 

We  summarize  the  results  on  model  selection  performance  of  the  penalized  estimator  in 
Figures  2-.3.  In  the  left  panels  of  Figures  2-3,  we  plot  the  frequencies  of  the  dimensions  of  the 
selected  model;  in  the  right  panel  we  plot  the  firequencies  of  selecting  the  correct  components. 
Prom  the  right  panels  we  see  that  the  model  selection  performance  is  particularly  good. 
From  the  left  panels  we  see  that  the  frequency  of  selecting  much  larger  model  than  the 
minimal  true  model  is  very  small.  We  also  see  that  in  the  design  with  correlated  regressors, 
the  performance  of  the  estimator  is  quite  good,  as  we  would  expect  from  our  theoretical 
results.  These  results  confirm  the  theoretical  results  of  Theorem  2,  namely,  that  when  the 
non-zero  coefficients  are  well-separated  from  zero,  with  probability  tending  to  one,  the 
penalized  estimator  should  select  the  model  that  includes  the  minimal  true  model  as  a 
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subset.  Moreover,  these  results  also  confirm  the  theoretical  result  of  Theorem  1,  namely  that 
the  dimension  of  the  selected  model  should  be  of  the  same  stochastic  order  as  the  dimension 
of  the  true  minimal  model.  In  summary,  we  that  find  the  model  selection  performance  of 
the  penalized  estimator  very  well  agree  with  our  theoretical  results. 

We  summarize  results  on  the  estimation  performance  in  Table  1 .  We  see  that  the  penalized 
quantile  regression  estimator  significantly  outperforms  the  canonical  quantile  regression,  as 
we  would  expect  from  Theorem  1  and  from  inconsistency  of  the  latter  when  the  number  of 
regressors  is  larger  than  the  sample  size.  The  penalized  quantile  regression  has  a  substantial 
bias,  as  we  would  expect  from  the  definition  of  the  estimator  which  penalizes  large  deviations 
of  coefficients  from  zero.  Furthermore,  we  see  that  the  two-step  estimator  improves  upon 
the  penalized  quantile  regression,  particularly  in  terms  of  drastically  reducing  the  bias.  The 
two-step  estimator  in  fact  does  almost  as  well  as  the  ideal  oracle  estimator,  as  we  would 
expect  from  Theorem  4.  We  also  see  that  the  (unarbitrary)  correlation  of  regressors  does  not 
harm  the  performance  of  the  penalized  and  the  two-step  estimators,  which  we  would  expect 
from  our  theoretical  results.  In  fact,  since  data-driven  value  of  A  tends  to  be  slightly  lower  for 
the  correlated  case,  as  we  would  expect  by  the  comparison  theorem  mentioned  in  Example 
8,  the  penalized  estimator  selects  smaller  models  and  also  makes  smaller  estimation  errors 
than  in  the  canonical  uncorrelated  case.  In  summary,  we  find  the  estimation  performance 
of  the  penalized  and  two-step  estimators  to  be  in  agreement  with  our  theoretical  results. 
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Example  1:  Isotropic  Gaussian  Design 


Mean  Co  norrn 

Mean  /"i  norm 

Bias 

Std  Deviation 

Canonical  QR 
Penalized  QR 

992.29 
5.14 

25.27 
2.43 

1.6929 
1.1519 

0.99 
0.37 

Two-step  QR 
Oracle  QR 

5.14 
5.00 

4.97 
5.00 

0.0276 
0.0012 

0.29 
0.20 

Example  2: 

Correlated  Gaussian 

Design 

Mean  Co  norm 

Mean  (i  norm 

Bias 

Std  Deviation 

Canonical  QR 
Penalized  QR 

988.41 
5.19 

29.40 
4.09 

1.2526 
0.4316 

1.11 
0.29 

Two-.step  QR 
Oracle  QR 

5.19 
5,00 

5.02 
5.00 

0.0075 
0.0013 

0.27 
0.25 

Taule  1 

The  table  displays  the  average  Co  and  fi  norm  of  the  estimators  as  well  as  mean  bias  and  standard 

deviation.  We  obtained  the  results  using  5000  Monte  Carlo  repetitions  for  each  design. 
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Hjstugram  of  the  numbflr  of  imrv-zcro  components  in  3(1/2) 


Hisloi^rnin  of  Ihp  immlx-r  of  correct  r om(HinpnU  stl<*u^  {/,■  =  a.  p  -  f)) 


Fig  2.  The  figure  siimmanzes  the  covariate  selection  results  for  the  isotropic  normal  design  example,  based 
on  5000  Monte  Carlo  repetitions.  The  left  panel  plots  the  histogram  for  the  number  of  covariates  selected  out 
of  the  possible  1000  covariates.  The  right  panel  plots  the  histogram  for  the  number  of  significant  covariates 
selected;  there  are  in  total  5  significant  covariates  a,mongst  1000  covariates. 


Histogram  of  the  number  of  noivzero  components  in  J(1/2) 


HrstoRrtim  of  the  number  of  coiTvrt  eom|ionents  selected  (s  =  f>,  /j  =  0  5) 


Fig  3.  The  figure  summarizes  the  covariate  selection  results  for  the  correlated  normal  design  example  with 
correlation  coefficient  p  =  .5,  based  on  5000  Monte  Carlo  repetitions.  The  left  panel  plots  the  histogram  for 
the  number  of  covariates  selected  out  of  the  possible  1000  covari.ates.  The  ri,ght  panel  plots  the  histogram, 
for  the  number  of  significant  covariates  selected;  there  are  in  total  5  significant  covariates  amongst  1000 
covariates.   We  obtained  the  results  using  5000  Monte  Carlo  repetitions. 
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5.2.  International  Economic  Growth  Example.  In  this  section  we  apply  £i-penalized 
quantile  regression  to  an  international  economic  growth  example,  using  it  primarily  as  a 
method  for  model  selection.  We  use  the  Barro  and  Lee  data  consisting  of  a  panel  of  138 
countries  for  the  period  of  1960  to  1985.  We  consider  the  national  growth  rates  in  gross 
domestic  product  (GDP)  per  capita  as  a  dependent  variable  y  for  periods  1965-75  and 
1975-85.^  In  our  analysis,  we  will  consider  model  with  nearly  p  =  60  covariates,  which 
allows  for  a  total  of  n  =  90  complete  observations.  Our  goal  here  is  to  select  a  subset  of 
these  covariates  and  briefly  compare  the  resulting  models  to  the  standard  models  used  in 
the  empirical  growth  literature  (Barro  and  Sala-i-Martin  [2],  Koenker  and  Machado  [21]). 

One  of  the  central  issues  in  the  empirical  growth  literature  is  the  estimation  of  the  effect 
of  an  initial  (lagged)  level  of  GDP  per  capita  on  the  growth  rates  of  GDP  per  capita.  In 
particular,  a  key  prediction  from  the  classical  Solow-Swan-Ramsey  growth  model  is  the 
hypothesis  of  convergence,  which  states  that  poorer  countries  should  typically  grow  faster 
and  therefore  should  tend  to  catch  up  with  the  richer  countries.  Thus,  such  a  hypothesis 
states  that  the  effect  of  initial  level  of  GDP  on  the  growth  rate  should  be  negative.  As 
pointed  out  in  Barro  and  Sala-i-Martin  [3],  this  hypothesis  is  rejected  using  a  simple  bivariate 
regression  of  growth  rates  on  the  initial  level  of  GDP.  (In  our  case,  median  regression  yields 
a  positive  coefhcient  of  0.00045.)  In  order  to  reconcile  the  data  and  the  theory,  the  literature 
has  focused  on  estimating  the  effect  conditional  on  the  pertinent  characteristics  of  countries. 
Covariates  that  describe  such  characteristics  can  include  variables  measuring  education  and 
science  policies,  strength  of  market  institutions,  trade  openness,  savings  rates  and  others 
[3].  The  theory  then  predicts  that  for  countries  with  similar  other  characteristics  the  effect 
of  the  initial  level  of  GDP  on  the  growth  rate  should  be  negative  ([3]) 

Given  that  the  number  of  covariates  we  can  condition  on  is  comparable  to  the  sample 
size,  the  covariate  selection  becomes  an  important  issue  in  this  analysis  ([24],  [31]).  In 
particular,  past  previous  findings  came  under  severe  criticisms  for  relying  upon  ad  hoc 
procedures  for  covariate  selection.  In  fact,  in  some  cases,  all  of  the  previous  findings  have 
been  questioned  ([24]).  Since  the  number  of  covariates  is  high,  there  is  no  simple  way  to 
resolve  the  model  selection  problem  using  only  the  classical  tools.  Indeed  the  number  of 
possible  lower-dimensional  model  is  very  large,  though  see  [24]  and  [31]  for  an  attempt 
to  search  over  several  millions  of  these  models.  Here  we  use  the  lasso  selection  device, 
specifically  the  £i-penalized  median  regressions,  to  resolve  this  important  issue. 


^  The  growth  rate  in  GDP  over  period  from  t\  to  t2  is  coTnmonly  defined  as  \og{GDP,^/GDPi^)  —  1. 
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Let  us  now  turn  to  our  empirical  results.  We  performed  povariate  selection  using  the  £i- 
penalized  median  regressions,  where  we  initially  used  our  data-driven  choice  of  penahzation 
parameter  A.  This  initial  choice  led  us  to  select  no  covariates,  which  is  consistent  with  the 
situations  in  which  the  true  coefficients  are  not  well-separated  from  zero.  We  then  proceeded 
to  slowly  decrease  the  penalization  parameter  in  order  to  allow  for  some  covariates  to  be 
selected.  We  present  the  model  selection  results  in  Table  2.  With  the  first  relaxation  of 
the  choice  of  A,  we  select  the  black  market  exchange  rate  premium  (characterizing  trade 
openness)  and  a  measure  of  pohtical  instability.  With  a  second  relaxation  of  the  choice  of  A 
we  select  an  additional  set  of  educational  attainment  variables,  and  several  others  reported 
in  the  table.  With  a  third  relaxation  of  A  we  include  yet  another  set  of  variables  also  reported 
in  the  table.  We  refer  the  reader  to  [2]  and  [3]  for  a  complete  definition  and  discussion  of 
each  of  these  variables. 

We  then  proceeded  to  apply  the  standard  median  and  quantile  regressions  to  the  selected 
models  and  we  also  report  the  standard  confidence  intervals  for  these  estimates.  In  Figures  4 
and  5  we  show  these  results  graphically,  plotting  estimates  of  quantile  regression  coefficients 
P{u)  and  pointwise  confidence  intervals  on  the  vertical  axis  against  the  quantile  index  u  on 
the  horizontal  axis.  We  should  note  that  the  confidence  intervals  do  not  take  into  account 
that  we  have  selected  the  models  using  the  data.  (In  an  ongoing  companion  work,  we  are 
working  on  devising  procedures  that  will  account  for  this.)  We  find  that,  in  all  models  that 
we  have  selected,  the  median  regression  coefficients  on  the  initial  level  of  GDP  is  always 
negative  and  the  standard  confidence  intervals  do  not  include  zero.  Similar  conclusions  also 
hold  for  quantile  regressions  with  quantile  indices  in  the  middle  range.  In  summary,  we 
believe  that  our  empirical  findings  support  the  hypothesis  of  convergence  from  the  classical 
Solow-Swan-Ramsey  growth  model.  Of  course,  it  would  be  good  to  find  formal  inferential 
methods  to  fully  support  this  hypothesis.  Finally,  our  findings  also  agree  and  thus  support 
the  previous  findings  reported  in  Barro  and  Sala-i-Martin  [2]  and  Koenker  and  Machado 
[21]. 

6.  Conclusion  and  Extensions.  In  this  work  we  characterize  the  estimation  and 
model  selection  properties  of  the  £i-penalized  quantile  regression  for  high-dimensional  sparse 
models.  Despite  the  non-linear  nature  of  the  problem,  we  provide  results  on  estimation 
and  model  selection  that  parallel  those  recently  obtained  for  the  penalized  least  squares 
estimator.  It  is  likely  that  our  proof  techniques  can  be  useful  for  deriving  results  for  other 
M-estimation  problems. 
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MODEL  SELECTION  RESULTS   FOR  THE  INTERNATIONAL 
GROWTH  REGRESSIONS 


Penalization 

Parameter 

Real  GDP  per  capita  (log)  is  included  in  all  models 

A  =  1.077968 

Additional  Selected  Variables 

A 
A/2  Black  Market  Premium  (log)  ', 

Political  Instability  :. 

A/3  Black  Market  Premium  (log) 

/        '  Political  Instability 

'  '■  '"  J  .  Measure  of  tariff  restriction 

Infant  mortality  rate 
Ratio  of  real  government  ''consumption"  net  of  defense  and  education 
Exchange  rate 
■~        -  ■  %  of  "higher  school  complete"  in  female  population 

%  of  "secondary  school  complete"  in  male  population 
A/4  Black  Market  Premium  (log) 

Political  Instability 

Measiu'e  of  tariff  restriction 

Infant  mortality  rate 

Ratio  of  real  government  "consumption"  net  of  defense  and  education 

Exchange  rate 

■  %  of  "higher  school  complete"  in  female  population 

%  of  "secondaj-y  school  complete"  in  male  population 

Female  gross  enrollment  ratio  for  higher  education 

%  of  "no  education"  in  the  male  population 

Population  proportion  over  65 

Average  years  of  secondary  schooling  in  the  male  population 

A/5  Black  Market  Premium  (log) 

•,  Political  Instability 

Measure  of  tariff  restriction 

Infant  mortality  rate 

Ratio  of  real  government  "consumption"  net  of  defense  and  education 

Exchange  rate 

%  of  "higher  school  complete"  in  female  population 

%  of  "secondary  school  complete"  in  male  population 

Female  gross  enrollment  ratio  for  higher  education 

-    •  .  %  of  "no  education"  in  the  male  population 

Population  proportion  over  65 
Average  years  of  secondarj'  schooling  in  the  male  population 
Growth  rate  of  population 
%  of  "higher  school  attained"  in  male  population 
Ratio  of  nominal  government  expenditure  on  defense  to  nominal  GDP 
Ratio  of  import  to  GDP 
Table  2 
For  this  particular  decreafung  sequence  of  penalization  parameters  we  obtained  nested  models.  All  the 
columns  of  the  design  matrix  were  normalized  to  have  unit  length. 
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Fig  4.   This  figure  plots  the  coefficient  estimates  and  standard  pointwise  90  %  confidence  intervals  for  the 
model  associated  with  A/2  which  selected  two  covariates  in  addition  to  the  initial  level  of  GDP. 
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Fig  5.   This  figure  plots  the  coefficient  estimates  and  standard  point-wise  90%  confidence  intervals  for  the 
model  associated  with  A/3  which  selected  eight  covariates  in  addition  to  the  initial  level  of  GDP. 
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There  are  several  possible  extensions  that  we  would  like  to  pursue  in  the  future  work. 
First,  we  would  hke  to  extend  out  results  to  hold  uniformly  across  a  continuum  of  quantile 
indices.  We  expect  that  most  of  our  results  will  generahze  to  this  case  with  a  few  appropriate 
modifications.  Second,  following  van  der  Geer  [37],  we  would  like  to  allow  for  regressor- 
specific  choice  of  the  penalization  parameter.  Specifically,  we  would  like  to  consider  the 
following  estimator: 

(6.1)  .    p{u)  e  arg  rnin   E„  [p^iy,  -  x^/3)]  +  -  ^  a,|^,| 


where  a-j  =  J^  Z]"=i  ^fj-  The  dual  problem  associated  with  (6.1)  has  the  form: 

max      E„  lj/,a,] 

a€R" 

(6-2)  \En[x^Ja^]\<^aJ,    j  =  l,...,p, 

{u  ~  1)  <  ai  <  u,    i  =  I, . . .  ,n. 

To  map  this  to  our  previous  framework,  we  can  redefine  the  regressors  and  the  parameter 
spaces  via  transformations  Xij  =  Xij/a-j  and  Pj{u)  =  aj(5j{u).  We  can  then  proceed  with  an 
analogous  proof  strategy.  Third,  we  would  like  to  extend  our  analysis  to  cover  non- sparse 
models  that  are  well-approximated  by  sparse  models.  In  such  a  framework,  the  components 
of  ^(u)  reordered  by  magnitude,  namely  |/3(i)(w)|  >  \P[2){u)\  >  >  \P(p-i){u)\  >  |/?(p)(u)|, 
exhibit  a  sufficiently  rapid  decay  behavior,  for  example,  |/3(fc)(ti)|  <  Rk"^'^  for  some  con- 
stants R  and  t.  Therefore,  truncation  to  zero  of  all  components  below  a  particular  moving 
threshold  can  still  lead  to  consistent  estimation. 

APPENDIX  A:  VERIFICATION  OF  CONDITIONS  E.1-E.5 

In  this  section  we  verify  that  conditions  E.1-E.5  hold  under  the  simple  set  of  conditions 
D.1-D.4  discussed  in  Section  2  and  also  hold  much  more  generally.  For  convenience,  we 
denote  by 

S^-  =  {QeIRP:||Q||  =  l,||a||o<fc} 

the  /c-sparse  unit  sphere  in  IR^.  In  what  follows,  we  show  how  the  key  constants,  such  as  q 
and  p.  appearing  in  E.1-E.5,  are  functions  of  the  following  population  constants  (which  can 
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possibly  depend  on  the  sample  size  n): 

•  /•=        "2^        fy,\x,i^'P{u)\x),  f     :=       sup       fy,\:c,iy\x), 


g{k)    :=      inf      E  [(Q'a;)^]  ,  /' :=     sup     _4|,_(y|x), 


where  values  of  y  and  x  range  over  the  support  of  y,  and  x, .  The  results  also  depend  on  the 
sparse  eigenvalue  of  sample  design  matrix 


4i{k)  :-  sup  En  [(q''x,)' 


already  mentioned  earlier.  As  an  illustration  we  compute  the  constants  in  (A.l)  for  two 
common  designs  used  in  the  literature. 

Example  7  (Isotropic  Normal  Design,  continued).  We  revisit  the  design  of  Example 
1.  For  concreteness  assume  that  the  errors  are  e  ~  N{{),  1).  Under  this  simple  design  we 
can  compute  the  values  of  the  several  constants  involved  in  the  analysis:  /  =  l/\/27r  <  0.4, 
/  =  1/V^  >  0.39,  /'  =  l/y27re  <  0.25,  7(/c)  =  ^JtT/S  >  0.6,  Q{k)  =  1,  and  Lp{k)  =  1. 

Example  8  (Correlated  Normal  Design,  continued).  Consider  next  the  design  in  Ex- 
ample 2.  For  concreteness  assume  that  e  ~  7V(0, 1)  and  that  p  —  1/2.  The  relevant  con- 
stants are  bounded  by  /  =   \/V2tt  <  0.4,  /  =   1/V2tt  >  0.39,  /'  =   l/x/^Tre  <  0.25, 

7(fc)  >  /i^v^  >  1/3,  Q{k)>V^  =  1/6,  and  ^{k)  <  j^  =  3. 

A.l.  Verification  of  E.1-E.5.  Conditions  E.l  (model  sparseness)  is  the  key  underly- 
ing model  assumption,  which  we  impose  throughout,  including  in  condition  D.2.  Lemmas 
1,  2,  3,  4,  and  5  below  establish  the  remaining  conditions  E.2-E.5. 

Lemma  1  (Verifying  Condition  E.2  -  Identification).  We  have  that  in  the  linear  quantile 
model  (2.1)  under  random  sampling,  for  each  /?  £  R^  :  ||/3  —  l3{u)\\  =  r,  \\/3  —  /?(u)||o  <  rn, 

(A.2)  QuiP)  -  QuiPiu))  >  qim)  [r^  A  gir))  , 

where 

(A.3)  g{r)  =  r      and     q(m)  ^  — ^minO,    \-j^^[m)\    \. 
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Thus,  condition  E.2  holds  with  q  =  q{n).  In  particular,  under  Conditions  D.I-D.4,  condition 
E.2  holds  with  q  =  q{n)  ci  1. 

Proof.   (Lemma  1)  Let  Fy^^  denote  the  conditional  distribution  of  y  given  x.  Prom 
Knight  [17],  for  any  two  scalars  w  and  v  we  have  that 

rv 

(A. 4)  pu{w  -  v)  -  pu{w)  =  -v{u  -  l{w  <  0})  +  /    {\{w  <z]-  l{w  <  0})dz. 

Jo 

Applying  (A. 4)  with  w  =  y—x'(5{u)  and  v  =  x'{f3-l3{u))  we  have  that  E  [-v{u  -  l{w  <  0})]  = 
0.  Using  the  law  of  iterated  expectations  and  mean  value  expansion,  we  obtain  for  z^^^  £  [0,  z] 


(A.5) 


Q„(/3)  -  Q„(/?(u))    =    E 
>     E 


0 


Fy\,,{x' p{u)  +  z)  -  Fy^,[x' (i(u))dz 


i(x'(/3  -  P{u)))'fyiAx'/3{u))]  -  f  E[|x;(/3  -  /3{u))\'] 


>     |e  [{x'iP  -  I3{um  -  f  E[|a:'(/3  -  /3(u))|3]. 


Next  define 

r„  =  sup  |f   :   Quif3{u)  +  fd)  -  Q„(/3(u))  >  ^f^~E  [[x'df]  ,  for  all  d  G  S^ 
By  (A.5)  we  have  that     ^.  "  •         "  '        '  .    ,  ' 


^  37    .    .    E[\a'x\^]       3  /     ,     , 


P{u)\\  <  TTi,,  we 


By  construction  of  r,„  and  the  convexity  of  Qu,  for  any  /3  such  that  || 
have  that 

Qu{P)-QuW{u))  >  =E[{x'{f3  -  (3{u)))^]aI\\P  -  P{u)\\  fMQuWiu)  +  r,nd)-QuW{u)) 


Letting  \\(3  —  f3{u)\\  =  r  we  have 


r  {   M^  Quif3{u)  +  Tmd)  -  QuiPiu))  )  > 


rfQ{m)rl 


and  =E\ix'{P-(3{u))f\  >      -^^    ' 


4  '  '      4 

where  the  first  inequality  holds  by  construction  of  r^;  hence 

QuiP)  -  Qu[p{u))  >  —^ A^ — >q(m)(r^Ar) 


for  q{m)  defined  in  (A. 3). 


D 
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The  following  two  lemmas  verify  the  Empirical  Pre-sparseness  condition.  ^ 

Lemma  2  (Verifying  Condition  E.3  -  Empirical  Pre-Sparseness).  We  have  that  the  num- 
ber of  non-zero  components  of  (3{u)  is  bounded  by  n/\p,  namely 

■  mu)\\o<nAp.  - 

Suppose  that  yi , . . . , y„  are  absolutely  continuous  conditional  onxi, . . .  ,Xn,  then  the  number 
of  interpolated  points,  h  =  \{i  :  yi  =  x^f3{u)}\,  is  equal  to  \\/3{u)\\o  with  probability  one. 

Proof.  Trivially  we  have  ||/9(u)||o  <  p-  Let  Y  =  (yi, . . .  ,y„)',  X  be  the  n  x  p  matrix 
with  rows  x[,i  —  I, . . .  ,n,  c  =  {ue',  (1  -  u)e',  Ae',  Ae')',  and  A  =  [I  -  /  X  —  A'],  where 
e  =  (1,1,...,!)'  denotes  vectors  of  ones  of  conformable  dimensions,  and  /  denotes  the  nxn 
identity  matrix.  Note  that  the  penalized  quantile  regi'ession  can  be  written  as 

min  ue'i+  +  (1  -  u)e'E,~  +  Ae'/3+  +  Ae'/?-  niin    c'w 

C  -^-  +  XP+  -  Xp-  =  Y  "^  Aw  =  Y 

Matrix  A  has  rank  n,  since  it  has  linearly  independent  rows.  By  Theorem  2.4  of  Bertsimas 
and  Tsitsiklis  [6]  there  is  at  least  one  optimal  basic  solution  w*  with  at  most  n  non-zero 
components.  We  defined  P{u)  as  a  basic  solution  with  the  minimal  number  of  non-zero 
components  (note  that  ||/3(w)||o  =  ||/3"''(u)||o  -l-  ||/5-(u)||o  since  A  >  0).  Let  h  denote  the 
number  of  interpolated  points.  We  have  that  n  —  h  components  of  ^  and  ^  are  non-zero. 
Therefore,  we  have  ||/3(u)||o  +  {n—  h)  <n  which  leads  to  ||/9(u)||o  <  h  <  n. 

To  prove  the  second  statement,  consider  the  dual  problem  maxa{ya  :  A' a  <  c}.  Condi- 
tional on  A'  consider  the  polyhedron  defined  by  {a  :  A'a  <  c]  which  has  a  finite  number 
of  vertices.  Since  c  >  0  componentwise  this  polyhedron  is  non-empty  (i.e.,  zero  is  always 
feasible  for  the  dual  problem).  Moreover,  the  form  of  A'  implies  that  {a  :  A'a  <  c}  is  a 
bounded  set.  Therefore,  if  the  solution  of  the  dual  is  not  unique  there  exist  vertices  a^,a~ 
such  that  Y'{a^  —  a^)  =  0.  This  is  a  zero  probability  event  since  Y  is  absolutely  continuous 
conditional  on  X  and  the  number  of  vertices  is  finite.  Therefore  the  dual  problem  has  a 
unique  solution  with  probability  one.  If  the  dual  basic  solution  is  unique,  we  have  that  the 
primal  basic  solution  is  non-degenerate,  that  is,  the  number  of  non-zero  variables  equals 
n,  see  [G].  Therefore,  we  have  with  probability  one  that  ||/3(w)||o  +  (n  —  h)  =  n,  or  that 
\\P(u)\\o^h.  -    D 
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Prom  the  complementary  slackness  condition  of  linear  programming,  see  Theorem  4.5  of 
[6] ,  we  have  that  for  any  component  j  G  { 1 , . . . ,  p} 

Pj{u)  >  0      only  if       E„  [2;,,ai(ti)]  =  — ,  and 
(A.6)                               ^                                              .              "a 
Pj{u)  <  0      only  if       E„  [xijai{u)]  = 

where  a{u)  solves  the  dual  problem  (2.6). 

Lemm.'\  3  (Verifying  Condition  E.3  -  Empirical  Pre-Sparseness,  continued).     Let  m  = 

||;9(u)||o.  For  any  A  >  0  we  have 

'n?(j)[m) 


m  < 


A2 


Proof.  Let  a{u)  be  the  solution  of  the  dual  problem  (2.6),  T  —  support(^(u)),  and 
m  =  ||^(u)||o  =  \T\.  For  any  fc  6  T",  from  (A.6)  we  have  {X'a{u))k  =  sign(^fc(u))A  and,  for 
k  ^T  we  have  sign(^fc(u))  =  0.  Therefore,  by  Cauchy-Schwarz  inequality  we  have 

mX     =    sign0{u)ysign0{u))X  ^  sign0{u)y{X'a{u))  =  (x sign0 {u)))'  a{u) 


<     ||Xsign(^(u))i|||a(u)||  <  V^^W^)\\signi(3{u))\\\\aiu)\\, 

where  we  used  that  ||sign(^(u))||o  —  m.  Since  ||2(u)||   <   i/max{u,  1  —  u}n  <   y/n,  and 
||sign(;5(u))||  =  i/m  we  have  mX  <  n^/m^im),  which  yields  the  result.  D 

We  shall  need  some  additional  notation  in  what  follows.  Let  ,;  /: 

denote  the  score  function  for  the  ith  observation.  Define  the  set  of  m-sparse  vectors  near  to 
the  true  value  P{u)  . 

-    ■-      R{r,m):={PG-RP   ■.m\o<m,   \\0-p{u)\\<r},  '  ' -^^ 

and  define  the  sparse  sphere  associated  with  a  given  vector  /?  as 

§(/3)  =  {a  e  IRP  :  ||q.||  <  1,  support(Q)  C  support(/3)}.  .-  ,: 

Also,  define  the  following  empirical  and  linearization  errors 


eo(m,n,p) 
(A. 7)  ei(r,  m,  n,p) 

e2(r,  m,n,p) 


suPaes-  |G„(qV,(/3(u),u))|, 

suP;geR(r,m),a£S(/3).     |G„(aVj(/3,  u))  -  G„(a't/'i(/?(w),u))|, 
sup/3efi(r,m),aeS(/3)     V^\E[a'^i{P,u)]  -  E[aVj(/3(w),u)]|. 
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where  Gn  is  the  empirical  process  operator,  that  is  (G„(/)  :=  n~^''^  Y17=\if{^i)  ~E[/(X,)]). 
Next  we  verify  condition  E.4.  .  ^  . 

Lemma  4  (Verifying  Condition  E.4  -  Empirical  Sparseness).     Let  m  =  ||/3(u)||o,  r  = 
\\/3{u)—P{u)\\.  and  suppose  thatyj,  ■  ■  ■  ,yn  o,i^e  absolutely  continuous  conditional  onxi, . . .  ,Xn- 
We  have  that,  in  the  linear  quantile  model  (2. 1)  under  random  sampling,  -■; 


^  yjn  log(r7  V  p)0(m)  V  ^Jn  log(n  V  p)ip{m.) 
Vrn  <p  jj{m)  -{r  Al)  +  ^/m. 


uniformly  in  m  <   n  and  r,  where  iJi{m)   =   ^J ip{m){%J ip(rn) f  VI).    Therefore,  provided 
ip{m)  <p  4'{m),  condition  E.4  holds  with  jjl  =  n{n),  namely 

I — ^        n,         ,         , —  Jnlog(n  \/ p)(j){m) 
^  <P  IJ^  -^[r  M)  +  M- ^  ^  '  ■ 

In  particular,  under  D.I-D.4,  <p{m)  <  1  and  (I>{1)  >   1,  so  that  condition  E.4  holds  with 

Proof.   (Lemma  4)  It  will  be  convenient  to  define  three  vectors  of  rank  scores  (dual 
variables):  '  ,  '         . 

1.  the  true  rank  scores,  a*{u)  —  u  —  \{yi  <  x[l3{u)]  for  i  =  1, . . . ,  n; 

2.  the  estimated  rank  scores,  ai{u)  =  u  —  l{y,  <  x\P{u)}  for  i  =  1, . . . ,  n; 

3.  the  dual  optimal  rank  scores,  2(u)  that  solve  the  dual  program  (2.6). 

Let  T  denote  the  support  of  p{u).  Let  x.^  =  {xij,j  G  T)',  and  Pf{u)  =  [Pj{u),j  e  T)' . 
Fi'om  the  complementary  slackness  characterizations  (A.G)  we  have  that 


nE„   a:.-a,(u) 
(A.8)        sign(/3^(u))  = ^-^ '-,  i.e.  ^M  =  ||sign(/?^(u))||  = 


"Eji  [2;jf2i(-u)j 


A 


Therefore  we  can  bound  the  number  of  non-zero  components  of  /9(u)  provided  we  can 
bound  the  empirical  expectation  in  (A.8).  This  is  achieved  in  the  next  step  by  combining 
the  maximal  inequalities  and  assumptions  on  the  design  matrix. 

Using  the  triangle  inequality  in  (A.8),  write 
■     A\/m  <    nE„   i  -(ai(u)  -  a,(u))|    +    nE„   a; -(af(zt)  -  a*(u))|    +    nE„   x..*a*(u)l    . 

II  L'-'  Jllll  L'-'  Jllll  L^-*  Jll 
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Then  we  bound  each  of  the  three  components  in  this  display.  To  bound  the  last  component, 
we  use  Lemma  9  to  get 

||nE„  [a;.^a*(u)]||  <  ^/7^€o (m,n,p)  <p  ^JnmAog(nyp)  (^fp{m)V  \/4'{m) 

To  bound  the  first  component,  we  observe  that  ai{u)  ^  ai{u)  only  if  y,  =  x'^P{u).  By  Lemma 
2  the  penalized  quantile  regression  fit  can  interpolate  at  most  m  points  with  probability 
one.  This  implies  that  E„  [|2i(u)  -  ai(u)p]  <  m/n.  Therefore,  we  get 

\\nEn  \x  c;{ai{u)  -  ai(u))|       <    nsup^ggm  E„  [\a'xi\  \ai{u)  -  ai{u)\] 

<  n sup„g§^  ^/En  [\a'xi\'^]^En  \\ai{u)  -  ai{u)[^] 

<  ■^n(f>{m)m. 

To  bound  the  second  component,  note  that 

||nE„  \x^!f{a.,{u)  -  a*(u))j  ||     =  ||\/n  G„  [x.f{ai{u)  -  a*(ti)))  ||  +  ||nE  [x.^(ai(u)  -  a*(u))]  || 

<  v^ei  {r,m,n,p)  +  y/ne2  {r,m,n,p) 
<p  sjnm. \o%(n  V  p) \/^[m)  V  4>{m)  +  n^J Lp{m){sJ ip{m)] r  A  1) 

where  we  use  Lemma  8  and  Lemma  10  to  bound  respectively  ei(r,  m,n,p)  and  e2(r,  m,  n,p). 

Setting  fi{m)  =  \f^p{m){-~/ipijri) f  V  1)  >  ^ip{m)  and  using  that  m  <  n  the  first  result 
follows.  _ . 

Under  D.1-D.4  we  (^(n)  <  1,  and  0(1)  >  1  and  condition  E.4  holds.  D 

Lemma  5  (Verifying  Condition  E.5  -  Empirical  Error).  We  have  that,  in  the  linear 
quantile  model  (2.1)  under  random  sampling,  and  uniformly  over  m  <  n,  r  >  0,  and  the 
region  R{r,  m);  ... 


\Qu{f3)  -  Qu{P)  -  {Qu{f3{u))  -  Qu{0[u)))  I  <p  V("^  +  ^og("^    [^^[m  +  s)v4'{m  +  s)  )  . 
In  particular,  under  D.I-D.4  we  have  that  condition  E.5  holds,  namely 

\Qu{I3)  -  QuW)  -  [QuWiu])  -  QuiPH))  I  <p  -^Y^(m  +  s)log(nVp)0(m  +  5) 
uniformly  over  m  <  n,  r  >  0,  and  the  region  R[r,m).     ■  ■        .- 

Proof.  For  convenience  let  e„  :=  \QuiP)  -  Qu{P)  -  (<3u(/3(u))  -  Qu(/3('")))  I-  Since  r  > 
11/3  -  P{u)\\,  and  ||/3  -  P{u)\\o  <m.  +  s  we  have  that 

-      -^lo  ^i('^^'rn  +  s,n,p)dz=^ -^ei{r,m  +  s,n,p).  •_ 
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The  first  result  follows  from  Lemma  8. 
Under  D.1-D.4  we  have  (/'(n)  <  1  and  0(1)  >  1  and  condition  E.5  holds.  D 

A. 2.  Controlling  Empirical  and  Linecirization  Errors.  Here  we  exploit  the  tech- 
nical results  of  Appendix  A. 4  to  control  the  empirical  errors  eo  and  ei.  These  technical 
results  provide  the  maximal  inequalities  for  a  collection  of  empirical  processes  indexed  by 
submodels'  dimensions  m.  <  n,  which  may  be  of  some  independent  interest.  These  technical 
results  and  their  usage  rely  on  the  concepts  of  the  VC  dimension  and  the  uniform  covering 
number  for  a  class  of  functions  (see,  e.g.,  [38]). 

We  begin  with  a  bound  on  the  VC  dimension  of  relevant  functions  classes. 

Lemma  6.     Consider  a  fixed  subset  T  c  {1,2,...  ,p},  \T\  —  m.  The  classes  of  functions 

J^T  =  {a'{MP,u)~ij^iP{u),u))    :Qe§(/3),support(/3)  cr},   and 

St  =  {QV^(/3(u),")    :    support(a)  C  T} 
have  their  VC  index  bounded  by  cm  for  some  universal  constant  c. 

Proof.  We  prove  the  result  for  !Ft,  and  we  omit  the  proof  for  Qt  as  it  is  similar. 

Consider  the  classes  of  functions  W  :=  {x'a  :  support(a)  C  T}  and  V  :=  {l{y  <  x' (3}  : 
support(/3)  C  T}  (for  convenience  let  Z  —  [y,x)).  Since  T  is  fixed  and  has  cardinality  m, 
their  VC  index  is  bounded  by  m  +  2;  see,  for  example,  van  der  Vaart  and  Wellner  [38]  Lemma 
2.6.15.  Next  consider  /  G  J-t  which  can  be  written  in  the  form  f{Z)  :=  g{Z){l{h{Z)  < 
0}  -  l{p{Z)  <  0})  where  g  eW,  l{h  <  0}  and  l{p  <  0}  G  V.  The  VC  index  of  J^t  is  by 
definition  equal  to  the  VC  index  of  the  class  of  sets  {{Z,  t)  :  f{Z)  <  t},  f  e  Tt^  t  E^.  We 
have  that 

{{Z,t):f{Z)<t}     =  {iZ,t):g(Z){l{h(Z)<0}-l{piZ)<0})<t} 

=  {{Z,t):h{Z)>0,p(Z)>0,t>0}U 

.      U  {iZ,t):h{Z)<0,p(Z)<0,t>0}u 

U  {{Z,t):h(Z)<0,piZ)>0,g{Z)<t}U 

-     "  U  {{Z,t):hiZ)>0,piZ)<0,giZ)>t}. 

Thus  each  set  {{Z,t)  :  f{Z)  <  t}  is  created  by  taking  finite  unions,  intersections,  and 
complements  of  the  basic  sets  {Z  :  h{Z)  >  0},  {Z  :  p(Z)  <  0},  {t  >  0},  {(Z,  t)  :  g{Z)  >  t], 
and  {[Z,  t)  :  g{Z)  <  t}.  These  basic  sets  form  VC  classes,  each  having  VC  index  of  order  m. 
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Therefore,  the  VC  index  of  a  class  of  sets  {(Z,  t)  :  f{Z)  <  t},  f  G  Tf,  i  e  R  is  of  the  same 
order  as  the  sum  of  the  VC  indices  of  the  set  classes  formed  by  the  basic  VC  sets;  see  van 
der  Vaart  and  Wellner  [38]  Lemma  2.6.17.  D 

Next  we  control  the  uniform  L^  covering  numbers  for  function  classes  generated  by  taking 
the  union  of  all  m-dimensional  subsets  of  a  p-dimensional  set. 

Lemma  7.     For  any  m  <  n,  consider  the  classes  of  functions 

^m,n,p  =  {a'{MP,u)  -  ^p^{(5{u),u))  :  /?  G  RP,  ||/3||o  <  m,  a  G  S(/3)}    and 

Qm,n,v  =  {a'i>i{P[u),u)    iaeS;^}, 
with  envelope  functions  Fm,n,p  o-nd  Gm,n,p-  For  each  e  >  0 

/16e\^('^'"~^)  [ ep\^ 

SUp7V(e||Fm,„,p||Q,2,-^m,n,p,i'2((5))  ^  <^  (  j  \  — 

/leeN^^'^™"^^  /en 

SUpiV(e||G^,„,p||Q,2,em,n,p,i^2(Q))  <C[  -^ 


for  some  universal  constants  C  and  c. 

Proof.  Let  J-t  denote  a  restriction  of  !Fm,n,p  for  a  particular  choice  of  m  non-zero 
components.  It  follows  that  its  VC  dimension  is  at  most  cm  by  Lemma  6.  In  turn  this 
implies  that  the  covering  number  of  J-t  is  bounded  by  ,,         . , 

'-|^\  2(cm-l)  ,.  .- 


N{e\\FT\\Q,2.J'T.L2{Q))  <  C(cm)(16e) 


cm 


where  C  is  an  universal  constant,  see  van  der  Vaart  and  Wellner  [38]  Theorem  2.6.7.  Since  we 
have  at  most  (^)  <  {ep/m)"^  different  restrictions  T,  the  total  covering  number  is  bounded 
according  the  statement  of  the  lemma.  The  proof  for  Gm,n,p  follows  similarly.  D 

Next  we  proceed  to  control  the  empirical  errors  eo  and  ej  as  defined  in  (A. 7). 
Lemma  8  (Controlling  error  ei).      We  have  that  in  the  linear  quantile  model  (2.1) 
ei{r,m,n,p)  <p  v'mlog(n  V  p)  max  |  V'(/?(77i),   y^(p{m)j  . 
uniformly  in  r  and  m.  <  n.  .  ■ 
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Proof.  By  definition  of  ej  we  have  ei(r, m,n,p)  <  supy^^^^  |(G„(/)|.  Prom  Lemma 
7  the  uniform  covering  number  of  J-n,m,p  is  bounded  by  C(16e/e)  ''^"'~  ^  (ep/m)'".  Using 
Lemma  18  we  have  that  uniformly  in  m  <  n 

(A.9)  sup     |G„(/)|<p\/mlog(r!,Vp)max|     sup     E[/2]V2,     sup     E„[/2]i/2l         ,, 

Since  |q'  (t/.^(/3,^i.)  -  t/<i(/3(u),  u))\  =  \a'x^\  \l{y,  <  x'^p}  -  l{y,  <  x;/?(u)}|  <  \a'x,\, 

(A.  10)  E„[/2]  <  En  [\a'x,f]  <  (i>{m)    and    E[/2]  <  E  [|Q-'a;,|2]  <  >^[rn), 

using  the  definition  of  <?i(m,)  and  ip{m).  Combining  (A. 10)  with  (A.9)  we  obtain  the  result. 

a 

Next  we  bound  eo  using  the  same  tools  we  used  to  bound  ei. 

Lemma  9  (Controlling  empirical  error  eo)-     In  the  linear  quantile  model  (2.1)  we  have 
eo[m,n,p)     <p     ^m log(n  V  p)  max  | ^/ip{m) ,   \/(p(m)| 
uniformly  in  m  <  n. 

Proof.  The  proof  is  similar  to  the  proof  of  Lemma  S  with  Qm,n,p  instead  of  J-m,n,p-  Note 
thatfor^  e  g„,,n.p^Qh&ve¥,rAg'^]=En\{a'^,{P{u),u)f]=Kn  [{a'xif{\{y^  <  x'^Piu)}  -  u)^] 
<  E„  [[a'x^f]  <  4){m)  for  all  a  e  §™.  ■    ,  D 

Alternatively  we  could  bound  eq  using  Theorem  5.2.2  of  Stout  [34]  to  achieve  a  dependence 
on  4){1)  instead  of  0(m)  by  making  additional  assumptions  on  the  covariates  x^^-j.  Now  we 
proceed  to  bound  £2. 

Lemma  10  (Controlling  linearization  error  (o)-      We  have  that  in  the  linear  quantile  model 

e2(r,  m,  n,p)  <  v/nyi^(m)  f  y'v?(m,)/r  A  1  ^ 

uniformly  in  r  >  0  and  m  <  n. 

Proof.  By  definition  we  have 

eoir.m.n.p)     =    sup^gR(^„)  ,,,gg(^)  vn|E[aVt(/3, «)]  -  E[aVj(/3(u),u)]| 

=     sup^6fl(r,m).a6S(/3)  \/H|E[(a'2;, )  (l{y,  <  x'^p}  -  l{y,  <  x[P{u)})]\. 
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By  the  Cauchy-Schwarz  inequality  the  expression  above  is  bounded  by 


v^  sup  JE[(q'x02]     sup      ^E[[\{y^<x'^l3]-\{y,<x'^(3[^)]Y]. 

aeS^  /36R(r,m) 

By  definition  v'(m)  =  sup^^gm  E[(Q'xi)^].  Next,  since  |  \{y^  <  x'^P}  -  l{yi  <  x[P{u)]  \  < 
l{\y,  -  x'^P{u)\  <  [x'iiP  -  I3{u))\},  we  have 

E[(l{y,  <  x[P}  -  1{?/.  <  x',0iu)})^]     =  E  [|l{y,  <  </?}  -  l{y,  <  x'^P{u)}\] 

<  E  l{\y,-x',p{u)\<\x',iP-0{um] 

<  E  [/^i^yj(l')')|  fyiAt  +  x'Mu)\x,)dt  A  l] 

<  (2/11^  -  P[u)\\  sup^gsjr  E  [\a'x^\])  A  1 

<  (2rf/^{^)  A  1. 

n 

A. 3.  Lemmas  on  Sparse  Eigenvalues.  In  this  section  we  collect  lemmas  on  the 
maximum  fc-sparse  eigenvalues  that  are  used  in  some  of  the  derivations  presented  earlier. 
Recall  the  notation  for  the  unit  sphere  S"""-^  =  {a  G  IR"  :  ||a||  =  1}  and  the  fc-sparse 
unit  sphere  Sp  =  {q  G  M^  :  ||a||  =  1,  ||a||o  <  k}.  For  a  matrix  M,  let  4>M{k)  denote  the 
maximum  /c-sparse  eigenvalue  of  M,  namely  (pM{k)  =  sup{  a' Ma   :  a  £  §p  }. 

We  begin  with  a  lemma  that  establishes  a  type  of  subadditivity  of  the  maximum  sparse 
eigenvalues  as  a  function  of  the  cardinality.  .  ■ 

Lemma  11.  Let  M  be  a  semi-definite  "positive  matrix.  For  any  integers  k  and  ik  with 
£  >  1  we  have 

(i>Mm<  \(i\<t>M{k).      '..  ■   ',  ■■■'  .^"    ■•■■'  ■■-"■ 

Proof.  Let  a  achieve  4>M[£k).  Moreover  let  YllJ\  ^i  =  ^  such  that  SUi  ll^i'tllo  —  l|o'||o- 
We  can  choose  ai's  such  that  ||Qi||o  <  k  since  \C\k  >  ik.  •  '        ' 

Since  M  is  positive  semi-definite,  for  any  i,j  we  have  a[Mai  +  a'.Maj  >  2  |Q'Mqj|  . 
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Therefore,  we  have 

•       ,  (l)M{ik)  =  a'Ma    =     J2^^^^i  +  Y.J2'^^^^J 

■:  ■■■'-■■  i=l  ■■..■■.;■'•:        ^       - 

•■'         ■  ■ ;  m  .       ■: ... 

■'  <  mEi!«'ii'<^M(iia,iio).  ■  ■■  .     . 

1=1 
Note  that  E,=i  11^,1^  =  1  and  thus  <PA./(£fc)  <  \i]  max,=i,...jf]  0a/(||q,||o)  <  \i](pM{k)-      □ 

The  following  lemmas  characterize  the  behavior  of  the  maximal  sparse  eigenvalue  for  the 
case  of  correlated  Gaussian  regressors.  We  start  by  establishing  an  upper  bound  on  (p{k) 
that  holds  with  high  probability.  , 

Lemma  12.  Consider  Xi  =  T}l~Zi,  where  Zi  ~  N{0,lp),  p  >  n,  and  sup^ggt  tt'Ea  < 
(T^(/c).  Let  0(fc)  be  the  maximal  k-sparse  eigenvalue  o/ E„  [sio;^],  for  k  <  n.  Then  with 
probability  converging  to  one,  uniformly  in  k  <  n, 


,fi{k)  <  a(k)  (^1  +  /kJ^V^^ 


Proof.  By  Lemma  11  it  suffices  to  establish  the  result  for  k  <  n/2.  Let  Z  be  the 
n  X  p  matrix  collecting  vectors  z-,  i  =  l,...,n  as  rows.  Consider  the  Gaussian  process 
Gk  ■■  {a,  a)  ^  a'Za/^,  where  (a,  q)  G  S^  x  S"-^  Note  that 


11^^11=  sup  \aZ a/ y/n]  =  sup  J a'E„[ziz[]a=  J (f){k). 

Using  Borell's  concentration  inequality  for  the  Gaussian  process  (see  van  der  Vaart  and  Well- 
ner  [38]  Lemma  A. 2.1)  we  have  that  P{||5i.||  — median||5/;||  >  r}  <  e^"'"  ".  Also,  by  classical 
results  on  the  behavior  of  the  maximal  eigenvalues  of  the  Gaussian  covariance  matrices  (see 
German  [13]),  as  n  -^  oo,  for  any  k/n  — >  7  G  [0, 1],  we  have  that  limt  „(median]|5^-||  —  1  — 
y/k/n)  —  0.  Since  k/n  lies  within  [0, 1],  any  sequence  k„/n  has  convergent  subsequence  with 
limit  in  [0, 1].  Therefore,  we  can  conclude  that,  as  n  — >  00,  limsup;.^_„(median||^fc^  ||  —  1  — 
■\/kn/n)  <  0.  This  further  implies  hmsup„  supj,<„(median]|5^,|]  —  1  —  \/k/n)  <  0.  There- 
fore, for  any  ro  >  0  there  exists  no  large  enough  such  that  for  all  n  >  no  and  all  k  <  n, 
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P  1 11^^. II  >  1  +  sjk/n  +  r  +  To  [  <  e""''  /^.  There  are  at  most  (^)  subvectors  of  Zi  containing 
k  elements,  so  we  conclude  that  for  n  >  no, 

p\  sup  ^a'¥.n\ziz'^a  >  1  +  ^/k/^  +  r^  +  ro  I  <  (^6-""^/^ 

Summing  over  k  <  n  we  obtain 

TpI  sup  Ja'Enlz^z'^a  >  1  +  Jk/^  +  r,+ro\  <  f^  (fle""''^'- 

fc=l  "€S^  fc=l  V^V 


Setting  r/c  =   %Jck/n\ogp  for  c  >  1  and  using  that  (^)   <  p^ ,  we  bound  the  right  side 
by  2^_j  e'^~'^^'^'°SP  — >  0  as  n  -^  cx).  We  conclude  that  with  probability  converging  to 


one,  uniformly  for  all  k:  sup^ggit  J a'¥.n\ziz[]a   <    1  +  \/k/n^\ogp.  Furthermore,  since 


sup^ggjc  a'Ea  <  (J~(k),  we  conclude  that  with  probability  converging  to  one,  uniformly 


for  all  k:  sup^gg^  J a'Er,[x^x'^\a  <  a{k){l  +  y'k/n^/logp).  D 

Next,  relying  on  Sudakov's  minoration,  we  show  a  lower  bound  on  the  expectation  of 
the  maximum  fc-sparse  eigenvalue.  We  do  not  use  the  lower  bound  in  the  analysis,  but  the 
result  shows  that  the  upper  bound  is  sharp  in  terms  of  the  rate  dependence  on  fc,p,  and  n. 

Lemma  13.  Consider  Xi  —  Y}l'^Zi,  where  Zi  ~  N{0,lp),  and  inf^^gic  a'Ea  >  a^{k).  Let 
4>{k)  be  the  maximal  k-sparse  eigenvalue  o/E„  [x^a;^],  for  k  <  n  <  p.  Then  for  any  even  k 
we  have  that:  '  •        ''       "  ' 


(1)    E 


,/m]>^^/{k/m^^EiP^)andi2)    /^  >p  ^^(fc/2)  log(p  - /c). 


Proof.  Let  X  be  the  n  x  p  matrix  collecting  vectors  Xj,  i  =  1, . . . ,  n  as  rows.  Consider 
the  Gaussian  process  (a,  Of)  h- >  a'Xa/y/n^  where  {a,  a)  G  S^  x  S""-*.  Note  that  \/4>{k)  is 
the  supremum  of  this  Gaussian  process  ■        > 


(A. 11)       :■      '.'  sup  \a' X  a/ ^\  =  sup  J  a'En[xix'^\a  ^  J  4>{k).        ;'-,.•,.;    ,.■■ 

Hence  we  proceed  in  three  steps:  In  Step  1,  we  consider  the  uncorrelated  case  and  prove 
the  lower  bound  (1)  on  the  expectation  of  the  supremum  using  Sudakov's  minoration,  using 
a  lower  bound  on  a  relevant  packing  number.  In  Step  2,  we  derive  the  lower  bound  on  the 
packing  number.  In  Step  3,  we  generalize  Step  1  to  the  correlated  case.  In  Step  4,  we  prove 
the  lower  bound  (2)  on  the  supremum  itself  using  Borell's  concentration  inequality.     - 
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Step  1.  In  this  step  we  consider  the  case  where  S  =  /  and  show  the  result  using  Sudakov's 
minoration.  By  fixing  a  =  (1, . . . ,  l)'/\/H  €  S"~\  we  have  y/(p{k)  >  sup^^g^  E„[x^q]  = 
sup^ggfc  Za,  where  a  i— *  Zq  ;=  E„[2;Jq']  is  a  Gaussian  process  on  Sp.  We  will  bound 
£'[sup|^ggfc  Za]  from  below  using  Sudakov's  minoration. 

We  consider  the  standard  deviation  metric  on  Sp  induced  by  Z:  for  any  t,s  €  Sp,  , 


d{s,t)  =  ^a^Zt  -  Zs)  =  V^E[(Zt  -  Z,)2]  =  ^E1E„  [{x[{t  -  s))^]]  =  \\t  -  s\\/V^. 

Consider  the  packing  number  D{e,  Sp,  d),  the  largest  number  of  disjoint  closed  balls  of  radius 
e  with  respect  to  the  metric  d  that  can  be  packed  into  Sp,  see  [10].  We  will  bound  the  packing 
number  from  below  for  e  =  -4=.  In  order  to  do  this  we  restrict  attention  to  the  collection  T 
of  elements  i  =  (ii, . . . ,  ip)  €  Sp  such  that  ij  =  l/\//c  for  exactly  k  components  and  1^  =  0 
in  the  remaining  p  —  k  components.  There  are  IT"!  =  (^)  of  such  elements.  Consider  any 
s,t  €  T  such  that  the  support  of  s  agrees  with  the  support  of  t  in  at  most  k/2  elements.  In 
this  case  • 

(A.12)  \\s-~tf-t\t,-s,\'>       ^       1+       Z       I  >  2^1  =  I. 

7=1 1  j€svipport(t)  ji€  support  (s) 

\support(s)  \support(() 

Let  V  be  the  set  of  the  maximal  cardinality,  consisting  of  elements  in  T  such  that  |support(i)\ 
support(s)|  >  k/2  for  every  s,t  eV.  By  the  inequality  (A.12)  we  have  that  D{l/y/n,  Sp,  d)  > 
\V\.  Furthermore,  by  Step  2  given  below  we  have  that  \V\  >  [p  —  k)^/'^. 

Using  Sudakov's  minoration  ([12],  Theorem  4.1.4),  we  conclude  that 
EfsupZJ  >sup^yiogi?(e,S^,d)  >  J\ogD{l/^,%,d)  >  \jk\og[p  -  k)/{2n), 

proving  the  claim  of  the  lemma  for  the  case  S  =  /. 

Step  2.  In  this  step  we  show  that  \V\  >  [p  —  k)^^'^. 

It  is  convenient  to  identify  every  element  t  £  T  with  the  set  support(^),  where  support(^)  = 
{j  e  {I,..  .,p}  :  tj  -  l/Vk},  which  has  cardinahty  k.  For  any  i  e  T  let  Af{t)  ^  {s  e  T  : 
|support(<)  \  support(s)|  <  k/2}.  By  construction  we  have  that  maxter  \J^{t)\\V\  >  \T\. 
Since  as  shown  below  maxfgr  |-A/'(t)|  <  K  :=  (fc/2)(''fc/2  )  ^°^'  every  t,  we  conclude  that 
\V\>\T\/K={l)/K>ip-k)'^/^. 

It  remains  only  to  show  that  |A/'(i)|  <  {k/2l^~kl2  )■  Consider  an  arbitrary  t  E  T.  Fix 
any  k/2  components  of  support(i),  and  generate  elements  s  G  A/'(i)  by  switching  any  of 
the  remaining  k/2  components  in  support(i)  to  any  of  the  possible  p  —  k/2  values.  This 
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gives  us  at  most  {^~^/2  )  ^'^^^  elements  s  S  ■N'{t).  Next  let  us  repeat  this  procedure  for 
all  other  combinations  of  initial  k/2  components  of  support  (t),  where  the  number  of  such 
combinations  is  bounded  by  (j|./2)-  I'^  this  way  we  generate  every  element  s  G  M{t).  From 
the  construction  we  conclude  that  \-N'{t)\  <  {k/2)i^'k/2  )• 

Step  3.  The  case  where  S  7^  /  follows  similarly  noting  that  the  new  metric,  d{s,t)  = 
^a'^{Zt-Z,)  =  ^E[{Zt-Zs)''l  satisfies 

dis,t)>a{2k)\\s-t\\/^   since    ||s  -  i||o  <  2fc. 

Step  4.  Using  Borell's  concentration  inequality  (see  van  der  Vaart  and  Wellner  [38]  Lemma 
A. 2.1)  for  the  supremum  of  the  Gaussian  process  defined  in  (A. 11),  we  have  P{\\/<f>[k)  — 
E[\/(l){k)]\  >  r}  <  2e~"'"  Z^,  which  proves  the  second  claim  of  the  lemma.  D 

Next  we  combine  the  previous  lemmas  to  control  the  empirical  sparse  eigenvalues  of 
Examples  1  and  2. 


Lemma  14.     For  k  <  n,  under  the  design  of  Example  1  we  have  ^(fc)  ~p  1  +  J    °^^. 
For  k  <n,  under  the  design  of  Example  2  we  have 

Proof.   Consider  Example  1.  Let  Xi-i  denote  the  ith  observation  without  the  first  com- 
ponent. Write  ,; 


1  E„  [x',^_, 

E„  [x,  _i]  0 


■E„ 


0  0 

0    E„  [a;t,_ix- _i 


=  M  +  N. 


We  first  bound  4>N{k)-  Letting  N^i^^i  =  E„  Xi,_iXj_j  we  have  (f)N{k)  =  4>N-i  _i(^)- 
Lemma  12  implies  that  (?!>Ar(A;)  <p  1  +  ^k/ny^logp.  Lemma  13  bounds  (p^ik)  from  below 
because  (pN-i-Ak)  >p  yJ{k/2n)\og{p  -  k). 

We  then  bound  4>M{k)-  Since  Mu  =  1,  we  have  (?I>m(1)  >  1-  To  produce  an  upper  bound 
let  w  =  {a,b'y  achieve  0m(^)  where  a  G  IR,  6  e  IR''""'.  By  definition  we  have  ||u;||  =  1, 
\\w\\o  <  k.  Note  that  \a\  <  1,  \\b\\  =  ^l  -  \a\'^  <  1,  ||5||i  <  ^\\b\\.  Therefore 

4>M{.k)^w'Mw     =    a'^ +  2ab'En[xi^li]<l+2b'En[x,^^i]  ■     ^  ''r  1 

<     l  +  2||6||i||E„[a;,,_i]||oo<l  +  2v^||6||||E„[xi,_i]||oo. 
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Next  we  bound  ||E„  [xi^-i]  ||oo  =  n'iaxj=2,...,p  |E„  [xij]|.  Since  E„  [xij]  ^  N{0,l/n)  for  j  = 
2,...  ,p,  by  (4.13)  we  have  ||E„  [a;^  _i]  ||oo  <p  y/jljnjlogp.  Therefore  we  have  0a/ (/c)  <p 

1  +  2y/k/^^/E^.  •■       '  .■  '■:        ■       '  '     ,■ 

Finally,  we  bound  (p.  Note  that  (pik)  —  sup^^^h  a'{M  +  N)a  =  sup^ggt  a' Ma  +  a'Na  < 
4>M{k)  +  (f)j^{k).  On  the  other  hand,  (p{k)  >  1  V  (p^^-^  -lik)  since  the  covariates  contain  an 
intercept.  The  result  follows  by  using  the  bounds  derived  above. 

The  proof  for  the  design  of  Example  2  is  similar  with  the  same  steps.  Since  —  1  <  p  <  1  is 
fixed,  the  bounds  on  the  eigenvalues  of  the  population  design  matrix  S  to  apply  Lemmas  12 
and  13  are  given  by  cr^(/c)  =  sup^ggt  a'T,a  <  (1  +  |/3|)/(1  -  |p|)  and  a~{k)  =  inf^ggt  a'Sa  > 
^(1  —  |/3|)/(1  +  \p\)-  To  bound  4)m{^)  comparison  theorem  (4.1.5)  allows  for  the  same  bound 
as  for  the  uncorrelated  design  to  hold.  .  D 

A. 4.  Maximal  Inequalities  for  a  Collection  of  Empirical  Processes.  The  main 
result  of  this  section  is  Lemma  18,  stating  a  maximal  inequality  that  controls  the  empirical 
process  uniformly  over  a  collection  of  classes  of  functions  using  class-dependent  bounds. 
We  need  this  lemma,  because  the  standard  maximal  inequalities  applied  to  the  union  of 
function  classes  yield  a  single  class-independent  bound  that  is  too  large  for  our  purposes. 

We  prove  the  main  result  by  first  stating  Lemma  15,  giving  a  bound  on  tail  probabilities 
of  a  separable  sub-Gaussian  process,  stated  in  terms  of  uniform  covering  numbers.  Here 
we  want  to  explicitly  trace  the  impact  of  covering  numbers  on  the  tail  probability,  since 
these  covering  numbers  grow  rapidly  under  increasing  parameter  dimension.  Using  the  sym- 
metrization  approach,  we  then  obtain  Lemma  17,  giving  a  bound  on  tail  probabilities  of 
a  general  separable  empirical  process,  also  stated  in  terms  of  uniform  covering  numbers. 
Finally  given  a  growth  rate  on  the  covering  numbers,  we  obtain  our  final  Lemma  IS,  which 
we  repeatedly  employ  throughout  the  paper. 

Lemma  15  (Exponential  Inequality  for  Sub-Gaussian  Process).  Consider  any  linear 
zero-mean  separable  process  {G(/)  :  /  S  J"),  whose  index  set  T  includes  zero,  is  equipped 
with  a  L2{P)  norm,  and  has  envelope  F .  Suppose  further  that  the  process  is  sub- Gaussian, 
namely  for  each  g  G  J-  —  J-: 


i2 


P{!'G(5)|>r?}<2exp(-^77Vi?Hg||p,2)  for  any      r?  >  0, 


with  D  a  positive  constant;  and  suppose  that  we  have  the  following  upper  bound  for  the 
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uniform  L2  covering  numbers  for  T : 


sup7V(e||F||Q,2,^,i2(Q))  <  n(e,  J^,L2)  for  each  e  >  0, 
Q 
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where  n(e,^,  L2)   is  increasing  in  1/e,   and  €\/logn(e,^,  L2)   — >  0  as  1/e  — >   cx)  anrf  is 
decreasing  in  1/e.   Then  for  K   >   D,  for  some  universal  constant  c   <   30,  p{!F,P)   :  = 

SUP/e^||/||p,2/||F||p,2, 


sup^g^|G(/)| 


I^I|P,2  /o"'^'''^^'  V^ogn{x,J^,L2)dx 


>cK'^  <  e-in(e,^,L2)-{(-^/^)'-i>de. 


The  result  of  Lemma  15  is  similar  in  spirit  to  the  result  of  Ledoux  and  Talagrand  [23], 
page  302,  on  tail  probabilities  of  a  process  stated  in  terms  of  Orlicz-norm  covering  numbers. 
However,  Lemma  15  gives  a  tail  probability  stated  in  terms  of  the  uniform  L2  covering 
numbers.  The  reason  is  that  in  our  context  estimates  of  the  uniform  L2  covering  numbers 
for  common  function  classes  are  more  readily  available  than  the  estimates  the  Orlicz-norm 
covering  numbers. 

In  order  to  prove  a  bound  on  tail  probabilities  of  a  general  separable  empirical  process, 
we  need  to  go  through  a  symmetrization  argument.  Since  we  use  data-dependent  threshold, 
we  need  an  appropriate  extension  of  the  classical  symmetrization  lemma  to  allow  for  this. 
Let  us  call  a  threshold  function  x  :  IR"  1-^  IR  /c-sub-exchangeable  if  for  any  v,w  E  K"  and 
any  vectors  v,w  created  by  the  pairwise  exchange  of  the  components  in  v  with  components 
in  w,  we  have  that  x{v)  \/  x{w)  >  [x{v)  \/x{w)]/k.  Several  functions  satisfy  this  property,  in 
particular  x{v)  =  \\v\\  with  k  =  \/2  and  constant  functions  with  k  —  1.  The  following  result 
generalizes  the  standard  symmetrization  lemma  for  probabilities  (Lemma  2.3.7  of  [38])  to 
the  case  of  a  random  threshold  x  that  is  sub-exchangeable. 

Lemma  16  (Symmetrization  with  Data-dependent  Threshold).  Consider  arbitrary  in- 
dependent stochastic  processes  Zi, . . . ,  Z„  and  arbitrary  functions  /Lii, . . . ,  /i.„  :  JT  »-*  ]R.  Let 
x{Z)  =  x{Zi, . . . ,  Zn)  be  a  k-sub-exchangeable  random  variable  and  for  any  t  G  (0, 1)  let  q-r 
denote  the  r  quantile  of  x{Z),  pr  :=  P{x{Z)  <  g,-)  >  r,  and  pr  ■=  P{x{Z)  <  Qr)  <  r.  We 
have 


i=l 


T 


>  xo  Vx(Z)     <  —P 

PT 


Y^ej{Z,  -  Hi) 


1=1 


> 


r 


xoVxiZ] 
4/c 


where  xq  is  a  constant  such  that  infjgjrP  (E"=i  Zi{f)\  <  ^)  >  1  -  ^. 
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Note  that  we  can  recover  the  classical  symmetrization  lemma  where  threshold  is  fixed  by 
setting  fc  =  1,  pr  =  1)  and  p^  =  0.  The  next  lemma  follows  from  combining  the  previous 
two  lemmas.  ■■■/,'■„■ 

Lemma  17  (Exponential  inequality  for  separable  empirical  process).  Consider  a  sep- 
arable empirical  process  G„(/)  =  ^^"'^'^  I3"=i{/(^i)  ~  E[/(Zi)]},  where  Zi,...,Zn  is  an 
underlying  i.i.d.  data  sequence.  Let  K  >  \  and  r  £  (0,1)  be  constants,  and  e„(J^, P^)  — 
e„{!F,  Zi, . . . ,  Zn)  be  a  k- sub -exchangeable  random  variable,  such  that 

rp(.?-,Pn)/4     I '    ■  - '  T 


Al       +T. 


|lF||p„,2  /  J\ogn[t,J^,L2)de  <  e„(.F,P„)  and  snpvar^f  <  -(4fcc/^e„(^,P„)) 

Jo  f^T  ^ 

for  the  same  constant  c  >  0  as  in  Lemma  15,  then 
P<^sup|G„(/)|>4fccA'e„(^,P„)     <-Ep(    j  e-'n{e,T,L2r^^  ''Ue 

Finally,  our  main  result  in  this  section  is  as  follows. 

Lemma  18  (Maximal  Inequality  for  a  Collection  of  Empirical  Processes).  Consider  a 
collection  of  separable  empirical  processes  Gn{f)  =  'n~^^'^  Y17=i{fiZi)  "  E[/(Z,)]},  where 
Zi,...,Zn  is  an  underlying  i.i.d.  data  sequence,  defined  over  function  classes  !Frn,Tn  = 
1, . . .  ,n  with  envelopes  Fm  =  supjg_;r^  \f{x)\,m  =  1, . . .  ,n,  and  with  upper  bounds  on  the 
uniform  covering  numbers  of  J-m  given  for  all  m  by 

n(e,.f„,,L2)  =  (nVp)'"(K/e)""^,  0  <  e  <  1, 

with  some  constants  k  >  1  and  u  >  1.  For  a  constant  C  :=  (1  +  \/2t')/4  set 


er,(.Fm,Pr2)  =  CJm  log(r)  Vp)  max  <   sup  ||/i|p,2,    sup  ||/||p„,2 

Then,  for  any  5  £  (0,1),  there  is  a  large  enough  constant  K  >  \/2/S,  for  n  sufficiently 
large,  the  inequality 

sup  |G„(/)|  <  4V2cKe^{J^m,  Pn),  .for  all  m.  <  n, 
holds  with  probability  at  least  1  —  5,  where  the  constant  c  is  the  same  as  in  Lemma  15. 
Now  we  prove  Lemmas  15,  16,  17,  and  18. 
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Proof  of  Lema^ia  15.  The  proof  follows  by  specializing  arguments  given  van  der  Vaart 
[36],  page  286,  to  the  sub-Gaussian  processes  and  also  tracing  out  the  bounds  on  tail  prob- 
abilities in  full  detail. 

Step  1.  There  exists  a  sequence  of  nested  partitions  of  j^,  {{^qi,  i  =  1, . . .  ,Nq),q  =  qo,qo  + 
1, . . .}  where  the  q-th  partition  consists  of  sets  of  I/2(P)  radius  at  most  ||F||p,22~'^,  where  go 
is  the  largest  positive  integer  such  that  2~'°  <  p{J-,  P)/4  so  that  qo  >  2.  The  existence  of 
such  partition  follows  from  a  standard  argument,  e.g.  van  der  Vaart  [36],  page  286,  which 
we  repeat  here:  To  construct  the  q-th  partition,  cover  J"  with  at  most  Uq  =  n(2~'',^,  L2) 
balls  of  L2{P)  radius  ||F||p,22~''  and  replace  these  by  the  same  number  of  disjoint  sets.  If 
the  sequence  of  partitions  does  not  yet  consist  of  successive  refinements,  then  replace  the 
partition  at  stage  q  by  the  set  of  all  intersections  of  the  form  rf^  .Fjj.  This  gives  partition 
into  at  most  A''^  =  Uqg  ■  ■  -Uq     sets,  so  that  log N^  —  Yl^^qa  lognj. 

Let  fqi  be  an  arbitrary  point  of  J-qi.  Set  iTq{f)  —  fqi  if  /  G  J-q^.  By  separability 
of  the  process,  we  can  replace  !F  by  Uq^ifqi,  since  the  supremum  norm  of  the  process 
can  be  computed  by  taking  this  set  only.  In  this  case,  we  can  decompose  /  —  7r^o(/)  = 
E^go+i('^'?(/)  ~  ^9-i(/))-  Hence  by  linearity 

GU)-G{^q,U))=     E    GK(/))-GU,_:(/))=     E    GK(/)-7r,_i(/)), 
9=90+1  (?=go  +  l 

and  |G(/)|  <  E^<,o+i  "^ax/  |G(7r,(/)  -  ^,_i(/))|  +  max/  |G(^,J/))|.  Thus 


00  00  /-  -. 

'{sup|G(/)|>  E^4<     E    P    max|GK(/)-7r,_i(/))|>7?J 

^^^  9=90  9=90  +  1         •-  -* 


■P|max|G(^,„(/))|>r?,„ 


for  constants  77g  chosen  below. 
Step  2.  By  construction  of  the  partition  sets  '  '  '.     . 

||7r,(/)  -  7r,_i(/)||p,2  <  2||F||p.22-(^-i)  <  4||F||p,22-^  for  g  >  go  +  1-    ' 

Setting  rjq   —  8A'||F||p^22"'v^log7Vg,  using  sub-Gaussianity,  setting  K   >   D,  using  that 
21ogA^g  >  log NqNq-i   >  logUq,  using  that  q  1-^  logn^  is  increasing  in  q,  and  2~*    < 
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p(^,  P)/4,  we  obtain 


OO  <-  •  .  •  -v  .  DO 

J2    P    max|G(7r,(/)-^,_i(/))|>77J  <       ^    7V,iV,_i2exp  (-77,V(4D||F||p,22-^)= 

=90  +  1         '^      -^  -'  9=90  +  1 

,  ■     .       „  oo 

■' :  :  -    ■  <        Yl    N,N,^r2exp{-{K/Df2\ogN,) 


9=90  +  1 


..     ;  ,,  '  ^        E    2exp(-{(K/Z?)2-l}log(A^,A^,_i))      ■ 

OO 

■  ■  ■  ,  ^..-^  '<        ^    2exp(-{(A7D)'-l}logn,) 

/■oo 

■  -      /      2exp(-{(A-/Z?)2-l}Iogn,)dg 

•        -  .  •■       .  ■'90 

rp{T.P)/4 

■  Jo 

By  Jensen's  inequality  y^oglV^  <  a,  :=  J2'j=qo  V^ogUq,  so  that 

oo  oo 

E      ^9^8      ^      A'||F||p,22-%. 
?=qo  +  l  9=90  +  1 

Letting  bq  =  2  ■  2~',  noting  a^+i  —  aq  —  sJlogUq+i  and  6^+1  —  bq  —  —2"'',  we  get  using 
summation  by  parts 

CO  oo  f  oo 

^    2"''a,     =     -    ^    (6,+i  -  6,)a,  =  -  (  a,&,|^+i  -     ^    6,+i(a5+i  -  o,; 

g=go  +  l  9=90+1  \  9=90  +  1 

/  00  \  00 

=        2.2-"°+iVlogn,„+i+     ^    2.2-('+i)ybi^^     =2    ^    2-^0^^, 

V  9=90  +  1  /  9=90  +  1 


where  we  use  the  assumption  that  2  ''y^logn^  — *  0  as  g  — +  oo,  so  that  —  cig^gl^+i 
2  .  2-(9o+i)  ^log 7T.(jo+i .  Using  that  2"''\/logn^  is  decreasing  in  g  by  assumption, 

2    f;    2-'yi^<  2  r2-'^^logn[2-i,J^,L2{P))dq. 

9=90  +  1  •^'JO 

Using  a  change  of  variables  and  that  2~'>°  <  p{J-,  -P)/4,  we  finally  conclude  that 


oo 


16       /•p(^,-P)/4 


Step  3.  Letting  r]q^  =  A'||F||p,2p(J^,  P) ^2  log  Nq^ ,  recalhng  that  Nq^  =  n^^,  using  ||7rqo(/)||p,2  < 
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|F||p,2  and  sub-Gaussianity,  we  conclude 

p{ma^|G(7r,„(/))i  >  Vgo}  <  Vexp  {-{K/Dflogrig)  <  2exp[-{iK/Df  -  l}logn, 

<r    2exp  ( -{{K/Df -l}\ogna)dq=   r    '        {x\n2)-^2n{x,J',L2{P)r^^'^^^^^-^Ux. 

Jqo-l  ^  '  •/p(^,P)/4 


Also,  since  Uq^  =  n(2   '°,^,  F),  2   ''o  <  p{J^,P)/A,  and  n{x,!F,P)  is  increasing  in  1/x,  we 
obtain; 

775,  =4i^||F||p,2[p(^,P)/4]^21ogn(2-9o,^,P)  <4V2A'||F||p,2  /  ^\ogn{x,J^,P)dx. 

J  U 

Step  4.  Finally,  adding  the  bounds  on  tail  probabilities  from  Steps  2  and  3  we  obtain  the 
tail  bound  stated  in  the  main  text.  Further,  adding  bounds  on  rjg  from  Steps  2  and  3,  and 
using  c  =  16/log2  +  4\/2  <  30,  we  obtain 

^  77,  <  cK||F||p,2y^  ^\ogn{x,:F,L2{P))dx. 


9=90 


D 


Proof  of  Lemma  16.  The  proof  proceeds  analogously  to  the  proof  of  Lemma  2.3.7 
(page  112)  in  [38]  with  the  necessary  adjustments.  Letting  q^  be  the  r  quantile  of  x{Z)  we 
have 


F 


!  =  1 


T 


>  XoVxiZ)}   <  P{x{Z)  >  Qr, 


.r 


>Xoyx{Z))  +P{x{Z)  <qr}. 


Next  we  bound  the  first  term  of  the  expression  above.  Let  Y  —  (V'l, . . . ,  F„)  be  an  inde- 
pendent copy  of  Z  =  (Zi, . . .  ,Z„),  suitably  defined  on  a  product  space.  Fix  a  realization 
of  Z  such  that  x{Z)  >  q-r  and  ||Z!r=i  ^iWj^  >  xq  V  x{Z).  Therefore  3fz  £  J-  such  that 
|X]r=i  ^i{fz)\  >  xq\/  x(Z).  Conditional  on  such  Z  and  using  the  triangular  inequality  we 
have  that  .        ■;  .  . 

By  definition  of  xo  we  have  inf/g;rF{|Er=i^i(/) I  <  f }  >  l-Pr/2.  Since  Fy- {x(y)  <  qr)  = 
pr,  by  Bonferroni  inequality  we  have  that  the  left  hand  side  is  bounded  from  below  by 
pT  —Pt/'2'  =  Pt/2-  Therefore,  over  the  set  {Z  :  x{Z)  >  qr,  ||E"=i  ^illjr  >  ^o  Va;(Z)}  we  have 

xq  V  x{Z)  V  x{Y) 


T^Py 


E(^'  -  ^'^ 


1=1 


> 


T 
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Integrating  over  Z  we  obtain 


\p{x{Z)>qr, 


E^. 


>xoVa;(Z)^  <  PzP^ 


T 


E(^'  -  ^^; 


> 


xo  Vx(Z)  \l  x{Y) 


Let  £i,  ...,£„  be  an  independent  sequence  of  Rademacher  random  variables.  Given  ei, . . . ,  £„, 
set  {Yi  —  Yi,Zi  =  Zi)  if  ^i  =  1  and  {Yi  —  Z^,  Z^  =  Yi)  if  e^  =  —1.  That  is,  we  create  vectors 
Y  and  Z  by  pairwise  exchanging  their  components;  by  construction,  conditional  on  each 
£i, . . .  ,£n,  {y,  Z)  has  the  same  distribution  as  (F,  Z).  Therefore, 


PzPy 


Y.^Y,  -  Z,, 


>  -^^ -^'^]^ -^^'-^  ^  =  E^P^Pr 


T 


E(^'  -  ^'■) 


xo  \l  x{Z)  \l  x{Y) 


By  x(-)  being  /c-sub-exchangeable,  and  since  Si^Yi  —  Zj)  =  (Fi  —  Zi),  we  have  that 

Xo  Vx(Z)  Vx(y) 


S.PzPy 


Y.^y^-Z^) 


xn  vx(Z)  vx(y)  ,       „  „  „ 

>  -^ ^-:^ -^  \  <  E.PzPy 


r 


Y.^Ay^-2c> 


2k 


By  the  triangular  inequality  and  removing  x{Y)  or  x{Z),  the  latter  is  bounded  by 


P 


^£,(y,  -  /!,) 


> 


.;^ 


•TO  V  xjY) 
4k 


^£,(Z,  -/ij 


1=1 


> 


xq  V  x(Z) 


J^ 


4k 


a 


Proof  of  Lemma  17.  We  would  like  to  apply  exponential  inequalities  to  the  general 


separable  empirical  process  G„(/) 


-1/2  ^n 


E7=l{fiZ^)  -  E[/(Z,)]},  which  is  not  sub- 


Gaussian;  here  Zi,...,Z„  is  an  underlying  i.i.d.  data  sequence.  To  achieve  this  we  use 
the  standard  symmetrization  approach.  Indeed,  we  first  introduce  the  symmetrized  process 
G°{f)  =  n~^^'^Yl?=i{^if{^i)}^  where  £!,...,£:„  are  i.i.d.  Rademacher  random  variables, 
i.e.,  P{£i  =  1)  =  P{£i  =  —  1)  =  1/2,  which  are  independent  of  Zi, . . . ,  Z„.  Then  the  tail 
probabihties  of  the  general  empirical  process  are  bounded  by  the  tail  probabilities  of  the 
symmetrized  process  using  the  symmetrization  lemma  recalled  below.  Further,  we  know  that 
by  Hoeffding  inequality  the  symmetrized  process  is  sub-Gaussian  conditional  on  Zj, .  . . ,  Z^ 
with  respect  to  the  L2(Pn)  norm,  where  ¥„  is  the  empirical  measure,  and  this  delivers  the 
result.  .       . 

By  the  Chebyshev's  inequahty  and  the  assumption  on  e„(.F,  P„)  we  have  for  the  constant 
T  fixed  in  the  statement  of  the  lemma 

P(|G„(/)|  >  4kcKeni:F,Vn)  )  <    ^"P/ ^"'^P'^"(-^)     ^      sup^g^varp/      ^ 

'  '  -  (4kcKe„{^,Fn)f       (4fcc/<'e„(^,P„))2  - 
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Therefore,  bj^  the  symmetrization  Lemma  16  we  obtain 

sup  |G„(/)|  >  4fcci^e„(^,P„)  I  <  -P  I  sup  |G°'(/)|  >  ci^e„(^,P„)  I  +  r. 

We  then  condition  on  the  values  of  Zi, . . . ,  Z„,  denoting  the  conditional  probability  measure 
as  Pg.  Conditional  on  Zi, . . . ,  Z„,  by  the  Hoeffding  inequality  the  symmetrized  process  G° 
is  sub-Gaussian  for  the  L2(Pn)  norm,  namely  forgeJ^  —  J^ 

P,{G°(5)>x}<2exp(-^^^ 

\     -^  Il5llp„,2 


Hence  by  Lemma  15  with  D  =  1,  we  can  bound 

Al. 

The  result  follows  from  taking  the  expectation  over  Zi, . . . ,  Zn-  □ 


sup|G°(/)|>c/re„(F,P„)     <    j^  e-in(e,^,L2)-^^-^>de 


Proof  of  Lemma  18.  The  proof  proceeds  in  two  steps,  with  the  first  step  containing 
the  main  argument  and  the  second  step  containing  some  auxiliary  calculations. 

Step  1.  In  this  step  we  prove  the  main  result.  First,  we  observe  that  the  bound  £  i— > 
n{e,J-m,  L2)  satisfies  the  monotonicity  hypotheses  of  Lemma  17  uniformly  in  m  <  n. 


Second,  recall  that  e„(J^m,P„)  :=  C^/m log(n  V  p)  max{supjgjr„  ||/||p,2,  supyg^r^  ll/l|p„,2} 
for  C  =  (1  +  V2u)/4.  Note  that  supjgjr^  ll/l|p„,2  is  \/2-sub-exchangeable  and  p(J^m,P„)  :  = 
sup/gj-^  ll/liiPn,2/ll-P'm||p„,2  >  l/\/"  by  Step  2  below.  Thus,  uniformly  in  m  <  n: 

i|i^m||p„,2  /  \/logn(e,J?^,L2)de  <  ||F„||ip„,2  /  ^/m log(n  V  p)  +  vm \og{K/e)de 

Jo  Jo 

/'P(^m,P„)/4  

<     (l/4)^mlog(n  Vp)   sup   ||/||p„,2  +  ||-?V.||p„,2  /         ,  ^vm]og{K/€)d€ 

feJ'n,  Jo 

':         ■     ■  <     ^mloginWp)  sup  ||/||j.„,2  fl  +  v^) /4  '',     \         :  'i- 

;         '  ■  !S       ^n  l,.'  771 :  ^n  j ,  '  '  ' 

which  follows  by  J^  ^/log{K/e)de  <  (/(f  Ide)^/^  {fP\og{K/e)dc)^^^  <  pVTb^,  for  1/^  < 
p  <  1  and  K  <  n  for  n  sufficiently  large. 

Third,  set  K  :=  ./2/5  >  1  so  that  B{K)  :=  (i^^  ._  i^  ^  2/5,  and  let  Tm  =  (5/(2mlog(n  V 
p)).  Recall  that  4\/2cC  >  1  where  1  <  c  <  30  is  defined  in  Lemma  15.  Note  that  for  any 
m  <  n  and  /  G  J-m,  we  have  by  Chebyshev  inequality  '  '  :     ~'  ■    "      ■ 

P(|G„(/)|  >  4^/2cA'e„(^^,P„)  )  <        '^^g^-f";^        <  ,   '/^  „,/^^"    ,    ^  '\  <  r„/2. 

(4v2c/i^e„(.fm,Pn))'^        (4v2cC)^mlog(nVp) 
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Using  Lemma  17  with  our  choice  of  Tm,  m  <  n,  k  >  I,  v  >  I,  and  p{J-m,^n)  <  1,  we 


obtain 


p|   sup   |Gn(/)|  >  4%/2c/^e„(J^^,P„),      3     m  <  n|  <  V  pi    sup   |G„(/)|  >  4x/2cA'e„(J-^,  P„)  i 


^  ^E 


Tn=l 


^0 


,  ,   V  M  ■  ■       ■    .  •  2  log(n  Vp) 

by  our  choice  of  B{K)  and  n  sufficiently  large. 
Step  2.  In  this  step  we  perform  some  auxiliary  calculations. 

To  establish  that  supygjr„  ||/||Fn,2  is  \/2-sub-exchangeable,  let  Z,  Y  be  created  by  pairwise 
exchanging  any  componentsof  Z  and  Y.  Then  ^/2  (sup^gjr^  ll/llp„(z).2  ^  supjgjr^  ll/llp„(y),2)  ^ 

{sup/e^„  ll/llp„(^),2  +  ^"P/e^^  11/111(^,2}'^'  ^  {^up^e^.En  [/(Z,)2]  +E„  [/(>;■)']  }'^'  = 
{sup^,^^E„  [/(Z,)2]  +E„  [/(l')2]}'^'  >  {sup^e^^  il/ll^„(z),2  Vsup^,^^  \\f\\KiY).2f'~  = 

SUP/e^„  li/l|p„{Z),2  Vsupjg;r„  ||/||p„(y),2- 

Next  we  show  that  p(J"rn,IPn)  :=  supygjr^  ||/||p„,2/||.Fm||p„,2  >  l/\/n  for  m  <  n.  The 
latter  follows  from  E„  [F^]  =  En[supj£_;r^  |/(Z,)p]  <  supj<„  sup^gj^^  |/(Zj)p,  and  from 
suP/ejr^  E„[|/(Zi)|2]  >  supjg^r^  sup,<„  \f{Zi)\'^/n.  □ 
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